In this jupyter notebook I have the honor to present a guide on how to adjust data of different nature using the python package $\texttt{PyMC}$. I will discuss the advantages this package has over Bayesian inference and the differences when doing algebraic or frequentist adjustments.
Most part of the whole notebook has been develop in order to solve every exercise suggested in Hogg et al. 2010, an excellent introduction to bayesian inference and fitting data.
Also, at the very end of the notebook there are two application examples displayed to real data, which I truly hope to be helpful to cover this tool utility in all astrophysical models range.
Diferences between $\texttt{emcee}$ and $\texttt{PyMC}$¶
Both $\texttt{emcee}$ and $\texttt{PyMC}$ are $\textit{Python}$ libraries designed to tackle problems related to Markov Chain Monte Carlo (MCMC) sampling in Bayesian statistics. Although both have a common purpose, they have some key differences in terms of implementation and features. Some highlighted differences are shown below:
- First of all, about the syntax, whose importance can not be forgotten, $\texttt{PyMC}$ provides a model-oriented syntax where prior and likelihood distributions may be more intuitively defined.
- Also, regarding abstracion level, $\texttt{PyMC}$ is a higher-level library that focuses on modeling Bayesian problems and offers a declarative syntax for specifying probabilistic models that makes $\texttt{PyMC}$ more accessible.
- Some relevant features to take into account about $\texttt{PyMC}$ are the wide range of sampling techniques, including MCMC and $\texttt{Theano}$ for enhanced computational efficiency. By the other hand, $\texttt{emcee}$, as long as I know, just focus on the MCMC method known as Ensemble Sampler.
- $\texttt{PyMC}$ has a larger and more active community and has integrated tools for results visualization (for example with $\texttt{arviz}$), which is always appreciated for newcomers.
Some examples to visualize the main differences between both libraries:
import numpy as np
import emcee
#define model and probability functions
def log_likelihood(theta):
#likelihood
pass
def log_prior(theta):
#prior
pass
def log_posterior(theta):
return log_prior(theta) + log_likelihood(theta)
#custom sampler
ndim = 2 # Dimensions
nwalkers = 100
sampler = emcee.EnsembleSampler(nwalkers, ndim, log_posterior)
#MCMC
initial_positions = np.random.rand(nwalkers, ndim)
sampler.run_mcmc(initial_positions, 1000)
import pymc as pm
#define model
with pm.Model() as model:
#priors
theta = pm.Uniform('theta', 0, 1, shape=2) # Uniform prior
#likelihood
likelihood = pm.SomeLikelihood('likelihood', observed=data, p=theta)
#sampling with PyMC
with model:
trace = pm.sample(1000)}
Brief introduction of sampling process in $\texttt{PyMC}$¶
In bayesian statistics, we are often interested in estimating the posterior distribution of model parameters given observed data. The posterior distribution, as you know for sure, is proportional to the product of the likelihood function and the prior distribution. However, for complex models (I would dare to say that practically all), obtaining the exact form of the posterior distribution can be analytically intractable. That is the reason where sampling methods come into play. In the context of bayesian statistics, sampling refers to the process of generating a representative set of samples from the posterior distribution of the model parameters.
In terms of sampling process, MCMC (Markov Chain Monte Carlo) is a powerful class of algorithms for sampling from complex probability distributions. The basic idea is to construct a sequence of random variables where the distribution of each variable depends only on the previous one (Markov Chain) such a way that, when run for a sufficiently long time, the samples generated by the chain converge to the desired posterior distribution. A Markov Chain is a mathematical model for a system that undergoes transitions, with probability, between a series of states. That probability of transition between states depends only on the current state and time elapsed, regardless of how the system evolve to its current state. It has the ergodicity property which indicates that, over time, the Markov Chain visits all states with a non-zero probability, and the long-term behavior is independent of the initial state. The basic MCMC algorithm starts with an initial parameter value and proposes a new parameter value based on the proposed distribution. Then, in terms of the probability given by priors and likelihood, the parameter value is accepted or rejected. That step is repeated over $N$ iterations until the sequence of accepted parameter values forms an approximate sample from the posterior distribution.
About default $\texttt{PyMC}$ algorithms, there is a default method known as NUTS (No-U-Turn Sampler), which is an advanced algorithm popular due to efficiency and adaptability. This method is particularly well-suited for complex and high-dimensional parameter spaces. NUTS adaptively tunes the step size during sampling, reducing the need for manual tuning. NUTS also stops when it detects a U-turn in the parameter space, preventing unnecessary exploration.
For all that, $\texttt{Model()}$ function in $\texttt{PyMC}$ is used to specify and define a Bayesian probabilistic model where all priors, likelihoods and distribution are defined for $\texttt{sample()}$ sampling processes. $\texttt{PyMC}$ will automatically select an appropriate sampler based on the characteristics of the model.
Part 0: import all packages needed¶
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import scipy as sp
import arviz as az
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import pymc as pm
import bambi as bmb
import xarray as xr
import emcee
import tqdm
import corner
import matplotlib.pyplot as plt
import matplotlib.ticker
from scipy.optimize import curve_fit
from scipy.optimize import curve_fit, minimize
import scipy.integrate as spi
from mpl_toolkits.axes_grid1 import make_axes_locatable
from scipy.stats import gaussian_kde
from matplotlib.patches import Ellipse
import matplotlib.colors as mcolors
from pymc import HalfCauchy, Model, Normal, sample, Mixture, Beta, HalfNormal, Uniform, TruncatedNormal
import warnings
print(f"Running on PyMC v{pm.__version__}")
print(f"Running on SciPy v{sp.__version__}")
print(f"Running on ArViz v{az.__version__}")
print(f"Running on Bambi v{bmb.__version__}")
print(f"Running on xarray v{xr.__version__}")
print(f"Running on corner v{corner.__version__}")
print(f"Running on NumPy v{np.__version__}")
print(f"Running on emcee v{emcee.__version__}")
print(f"Running on tqdm v{tqdm.__version__}")
Running on PyMC v5.10.3 Running on SciPy v1.12.0 Running on ArViz v0.17.0 Running on Bambi v0.13.0 Running on xarray v2024.1.1 Running on corner v2.2.3 Running on NumPy v1.26.3 Running on emcee v3.1.6 Running on tqdm v4.67.1
az.style.use('default')
plt.rcParams.update({
'xtick.direction': 'inout', # Ticks en ambos sentidos en el eje x
'ytick.direction': 'inout', # Ticks en ambos sentidos en el eje y
'xtick.bottom': True,
'ytick.left': True,
'xtick.top': False, # Colocar ticks en la parte superior del eje x
'ytick.right': False # Colocar ticks en el lado derecho del eje y
})
plt.rcParams.update({'font.size': 11, 'axes.linewidth': 1, 'axes.edgecolor': 'k'})
plt.rcParams['font.family'] = 'serif'
All classes used in this jupyter notebook¶
class algebraic_method:
def __init__(self, x, y, sigma_y):
self.x = x
self.y = y
self.sigma_y = sigma_y
"""
Fit a weighted linear regression using least squares.
Parameters:
- x: Vector of independent data.
- y: Vector of dependent data.
- sigma_y: Vector of errors in y.
Returns:
- X: Vector with the regression coefficients (b, m).
"""
def matrix_weighted_least_squares_fit(self):
""" Fit a weighted linear regression using standard weighted least squares. """
# to calculate the required sums:
N = len(x)
sum_w = np.sum(1 / self.sigma_y**2)
sum_wx = np.sum(self.x / self.sigma_y**2)
sum_wy = np.sum(self.y / self.sigma_y**2)
sum_wxx = np.sum(self.x**2 / self.sigma_y**2)
sum_wxy = np.sum(self.x * self.y / self.sigma_y**2)
# to calculate the coefficients of the weighted linear regression:
delta = sum_w * sum_wxx - sum_wx**2
b = (sum_wxx * sum_wy - sum_wx * sum_wxy) / delta
m = (sum_w * sum_wxy - sum_wx * sum_wy) / delta
X_w = np.array([b, m])
return X_w
def matrix(self):
""" Compute the linear regression matrix system. """
# create matrix A with one column of 1s and another column with the data 'x'
A = np.column_stack((np.ones_like(self.x), self.x))
# convert 'y' to a one-dimensional array
Y = np.array(self.y)
# create the diagonal matrix C
C = np.diag(self.sigma_y**2)
# get A^T
A_T = np.transpose(A)
# get C^-1
C_inv = np.linalg.inv(C)
# get A^T C^-1
A_T_C_inv = np.dot(A_T, C_inv)
# get A^T C^-1 A
A_T_C_inv_A = np.dot(A_T_C_inv, A)
# get la inversa de A^T C^-1 A
A_T_C_inv_A_inv = np.linalg.inv(A_T_C_inv_A)
# get A^T C^-1 Y
A_T_C_inv_Y = np.dot(A_T_C_inv, Y)
# get X
X = np.dot(A_T_C_inv_A_inv, A_T_C_inv_Y)
return X, A, Y, C
def get_results(self, method, print_str=True):
""" Compute regression statistics and display the results. """
if method == 'OLS':
X, A, Y, C = self.matrix()
elif method == 'WLS':
X = self.matrix_weighted_least_squares_fit()
A = np.column_stack((np.ones_like(self.x), self.x))
Y = np.array(self.y)
C = np.diag(self.sigma_y**2)
else:
raise ValueError("Invalid method; choose 'OLS' or 'WLS'.")
# predicted values
y_hat = np.dot(A, X)
# residual variance (degrees of freedom: N - 2)
sigma_squared = np.sum((Y - y_hat) ** 2) / (len(Y) - 2)
# parameters variance
x_bar = np.mean(A[:, 1])
var_m = sigma_squared / np.sum((A[:, 1] - x_bar) ** 2)
var_b = sigma_squared * (1 / len(Y) + x_bar ** 2 / np.sum((A[:, 1] - x_bar) ** 2))
# covariance matrix
cov_matrix = sigma_squared * np.array([
[np.sum(A[:, 1] ** 2) / (len(Y) * np.sum((A[:, 1] - x_bar) ** 2)), -x_bar / np.sum((A[:, 1] - x_bar) ** 2)],
[-x_bar / np.sum((A[:, 1] - x_bar) ** 2), 1 / np.sum((A[:, 1] - x_bar) ** 2)]
])
# display results
if print_str == True:
print(f'**{method}**')
print('Residual variance:', sigma_squared)
print('sigma_m:', np.sqrt(var_m))
print('sigma_b:', np.sqrt(var_b))
print('Covariance matrix:')
print(cov_matrix)
print('X matrix:', X)
print(f'y = ({X[1]} ± {np.sqrt(var_m) / np.sqrt(len(Y))}) x + ({X[0]} ± {np.sqrt(var_b) / np.sqrt(len(Y))})')
else:
pass
return X, A, Y, C, np.sqrt(var_m), np.sqrt(var_b)
def plot_results(self, color_OLS, color_WLS, reduced_string, fig=None, ax=None):
""" Plot the regression results with error bars and fitted lines. """
# obtain results
X, A, Y, C, sigma_m, sigma_b = self.get_results('OLS', False)
X_w, A_w, Y_w, C_w, sigma_m_w, sigma_b_w = self.get_results('WLS', False)
# fit values prediction
y_hat = np.dot(A, X)
y_hat_w = np.dot(A_w, X_w)
# parameters variance
x_bar = np.mean(A[:, 1])
sigma_squared = np.sum((Y - y_hat) ** 2) / (len(Y) - 2)
var_m = sigma_squared / np.sum((A[:, 1] - x_bar) ** 2)
var_b = sigma_squared * (1 / len(Y) + x_bar ** 2 / np.sum((A[:, 1] - x_bar) ** 2))
x_bar_w = np.mean(A_w[:, 1])
sigma_squared_w = np.sum((Y_w - y_hat_w) ** 2) / (len(Y_w) - 2)
var_m_w = sigma_squared_w / np.sum((A_w[:, 1] - x_bar_w) ** 2)
var_b_w = sigma_squared_w * (1 / len(Y_w) + x_bar_w ** 2 / np.sum((A_w[:, 1] - x_bar_w) ** 2))
# create figure
if fig == None and ax == None:
fig, ax = plt.subplots(1, 1, figsize=(8, 6))
rep = False
delta = 0
else:
rep = True
delta = 45
pass
# create values range
x_array = np.linspace(min(self.x), max(self.x), 1000)
if reduced_string == True:
reduced = ' (reduced data)'
else:
reduced = ''
# plot fit regressions
ax.plot(x_array, X_w[1] * x_array + X_w[0], linewidth=1, label=f"WLS Fit{reduced}", color=color_OLS)
ax.plot(x_array, X[1] * x_array + X[0], linewidth=1, label=f"OLS Fit{reduced}", color=color_WLS)
# labels
ax.set_xlabel(r'$x$')
ax.set_ylabel(r'$y$')
# show equations
if rep == True:
label_data = None
label_predic = None
color_predic = color_OLS
else:
label_data = 'Data (outliers)'
label_predic = r'$(x_i,\,\hat{y})$'
color_predic = color_OLS
# plot errorbars
ax.errorbar(self.x, self.y, yerr=self.sigma_y, fmt='o', label=label_data, capsize=5, markersize=5)
# predictions
ax.scatter(self.x, y_hat, marker='+', color=color_predic, label=label_predic)
ax.text(50, min(self.y) - 105 - delta,
f'$y = ({round(X[1], 15)} \pm {round(np.sqrt(var_m) / np.sqrt(len(Y)), 15)}) x + ({round(X[0], 15)} \pm {round(np.sqrt(var_b) / np.sqrt(len(Y)), 15)})$',
fontsize=8, color=color_OLS)
ax.text(50, min(self.y) - 125 - delta,
f'$y = ({round(X_w[1], 15)} \pm {round(np.sqrt(var_m_w) / np.sqrt(len(Y_w)), 15)}) x + ({round(X_w[0], 15)} \pm {round(np.sqrt(var_b_w) / np.sqrt(len(Y_w)), 15)})$',
fontsize=8, color=color_WLS)
ax.grid(False)
ax.legend()
return fig, ax
def reduced_data(self):
x_red = []
y_red = []
sigma_red = []
for i in range(4, 20):
x_red.append(self.x[i])
y_red.append(self.y[i])
sigma_red.append(self.sigma_y[i])
x_red = np.array(x_red)
y_red = np.array(y_red)
sigma_red = np.array(sigma_red)
return x_red, y_red, sigma_red
class GeometricRegression:
def __init__(self, sigma_x, sigma_y, rho_xy, aspect_ratio):
# initialize the object with the required parameters
self.sigma_x = sigma_x
self.sigma_y = sigma_y
self.rho_xy = rho_xy
self.aspect_ratio = aspect_ratio
def rotation(self, theta, aspect_ratio, sigma_x, sigma_y, rho_xy):
# function to calculate clockwise rotation for ellipses
return (1 / 2) * np.arctan2((2 * rho_xy * sigma_x * sigma_y), (sigma_x**2 - sigma_y**2) * aspect_ratio)
def ellipse_axes(self, confidence_interval=1):
# function to calculate axis lengths for ellipses
sigma_xy = self.rho_xy * self.sigma_x * self.sigma_y
cov_matrix = np.array([[self.sigma_x**2, sigma_xy], [sigma_xy, self.sigma_y**2]])
# eigenvalues and eigenvectors
eigenvalues, eigenvectors = np.linalg.eigh(cov_matrix)
# order eigenvalues and eigenvectors
order = eigenvalues.argsort()[::-1]
eigenvalues = eigenvalues[order]
eigenvectors = eigenvectors[:, order]
# axis lengths
major_axis = 2 * np.sqrt(confidence_interval * eigenvalues[0])
minor_axis = 2 * np.sqrt(confidence_interval * eigenvalues[1])
# get rotation
theta = self.rotation(0, self.aspect_ratio, self.sigma_x, self.sigma_y, self.rho_xy)
return major_axis, minor_axis, theta
class LuminosityPeriodPlotter:
"""
A class for generating a scatter plot relating the logarithm of luminosity with the logarithm of the period using input data.
Attributes:
data (dict): A dictionary containing the data to plot. Keys should match the column names in the dataset.
"""
def __init__(self, data):
"""
Initializes the LuminosityPeriodPlotter with the provided data.
Args:
data (dict): A dictionary containing the data to plot.
"""
self.data = data
def plot(self):
"""
Generates the scatter plot of luminosity versus period.
Returns:
matplotlib.figure.Figure: The generated scatter plot figure.
"""
# Initialize lists to store filtered data
data_P_d = []
L_x_L_Lsol_array = []
n_array = []
data_P_d_new = []
L_x_L_Lsol_array_new = []
n_array_new = []
# Filter data where P_d is not 0
for i in range(len(self.data['P_d'])):
if self.data['P_d'][i] != 0:
P_d = self.data['P_d'][i]
data_P_d.append(P_d)
L_x_L_Lsol = self.data['L_x/L_Lsol'][i]
L_x_L_Lsol_array.append(L_x_L_Lsol)
n = self.data['SpTnum_1'][i]
n_array.append(n)
# Filter data where P_d is not 0
counts = [1 for i in range(len(self.data['P_d'])) if self.data['P_d'][i] != 0]
print('w/ periods:', len(counts))
# Filter data where luminosity is not infinity
counts = [1 for i in range(len(L_x_L_Lsol_array)) if L_x_L_Lsol_array[i] != np.inf]
print('w/ luminosities and periods:', len(counts))
# Replace NaN in n_array with np.inf
j = 0
for i in range(len(n_array)):
if n_array[i] == np.nan:
n_array[i] = np.inf
j += 1
print('w/ spectral type:', len(n_array) - j)
# Convert to numpy arrays
data_P_d = np.array(data_P_d)
L_x_L_Lsol_array = np.array(L_x_L_Lsol_array)
n_array = np.array(n_array)
# Filter data based on luminosity
for i in range(len(L_x_L_Lsol_array)):
if L_x_L_Lsol_array[i] != np.inf:
P_d = data_P_d[i]
data_P_d_new.append(P_d)
L_x_L_Lsol = L_x_L_Lsol_array[i]
L_x_L_Lsol_array_new.append(L_x_L_Lsol)
n = n_array[i]
n_array_new.append(n)
# Convert to numpy arrays
data_P_d_new = np.array(data_P_d_new)
L_x_L_Lsol_array_new = np.array(L_x_L_Lsol_array_new)
n_array_new = np.array(n_array_new)
# Calculate the maximum value of n_array_new
n_array_max_new = max(n_array_new)
# Calculate the log10 of the data
log_data_P_d = np.log10(data_P_d_new)
log_L_x_L_Lsol_array = np.log10(L_x_L_Lsol_array_new)
# Create the scatter plot
fig, ax = plt.subplots(1, 1, figsize=(10, 8))
# Use a reversed colormap
colormap = plt.cm.hot
colormap = colormap.reversed()
# Normalize according to the values of n_array_new
normalize = plt.Normalize(vmin=min(n_array_new), vmax=n_array_max_new)
# Create the scatter plot
im = ax.scatter(log_data_P_d, log_L_x_L_Lsol_array, c=n_array_new, s=20, cmap=colormap, marker='o', zorder=1)
# Add the colorbar
divider = make_axes_locatable(ax)
cax = divider.append_axes('top', size='5%', pad=0.05)
fig.colorbar(im, cax=cax, orientation='horizontal').ax.set_xlabel('$n$, M$n$V')
cax.xaxis.set_ticks_position('top')
cax.xaxis.set_label_position('top')
# Set axis labels
ax.set_ylabel('$\log{(\mathcal{L}_x/\mathcal{L}_{bol})}$')
ax.get_xaxis().set_major_formatter(matplotlib.ticker.ScalarFormatter())
ax.set_xlabel('$\log{(\mathcal{P}\, [d])}$')
return fig, log_data_P_d, log_L_x_L_Lsol_array, n_array_new, n_array_max_new
class SaturationModel:
def __init__(self, x_data, y_data):
self.x_data = x_data
self.y_data = y_data
def build_model(self):
with pm.Model() as model_saturation:
# priors
log_P_sat = pm.Normal('log_P_sat', mu=0.5, sigma=0.1)
m = pm.Normal('m', mu=-1.4, sigma=1)
C = pm.Normal('C', mu=-0.0075, sigma=0.1)
sigma_log_L = pm.HalfNormal('sigma_log_L', sigma=1)
# constrain log_P_sat * m + C to be equal to log_L_sat
log_L_sat_constrained = pm.Deterministic('log_L_sat', log_P_sat * m + C)
V_b = pm.HalfNormal('V_b', sigma=1)
log_L_b = pm.Normal('log_L_b', mu=-3.5, sigma=1)
P_b = pm.Beta('P_b', alpha=2, beta=5)
# divide data into two ensembles
idx_saturation = pm.math.switch(self.x_data > log_P_sat, 0, 1)
# right log_P_sat
mu_right = pm.math.switch(idx_saturation, m * self.x_data + C, log_L_b)
# mixture term for in and outliers
inlier_likelihood = pm.Normal.dist(mu=mu_right, sigma=sigma_log_L)
outlier_likelihood = pm.Normal.dist(mu=log_L_b, sigma=pm.math.sqrt(sigma_log_L**2 + V_b))
likelihood_right = pm.Mixture('likelihood_right', w=[P_b, 1 - P_b], comp_dists=[inlier_likelihood, outlier_likelihood], observed=self.y_data)
# left log_P_sat saturation regime
mu_saturation = pm.math.switch(self.x_data <= log_P_sat, log_L_sat_constrained, m * self.x_data + C)
likelihood_saturation = pm.Normal('likelihood_saturation', mu=mu_saturation, sigma=sigma_log_L, observed=self.y_data)
# sampling
trace_saturation = pm.sample(1000, tune=1000, step=[pm.NUTS(target_accept=0.9, max_treedepth=40)], chains=4)
return trace_saturation, model_saturation
Part 1: how to fit data to a model (classical way)¶
We begin with an example of fitting data to a given model. In this case, it involves a simple linear regression with data for x, y, and error in y.
Firstly, we set our test data as a dataframe for easier handling.
This dataset consists in $x$, $y$ data, $\sigma_y$, $\sigma_x$ the standard error or uncertainties for each variable, and $\rho_{xy}$ the correlation coefficient between both variables (Hogg et al., 2010).
data = [[201,244,47,287,203,58,210,202,198,158,165,201,157,131,166,160,186,125,218,146],
[592,401,583,402,495,173,479,504,510,416,393,442,317,311,400,337,423,334,533,344],
[61,25,38,15,21,15,27,14,30,16,14,25,52,16,34,31,42,26,16,22],
[9, 4, 11, 7, 5, 9, 4, 4, 11, 7, 5, 5, 5, 6, 6, 5, 9, 8, 6, 5],
[-0.84, 0.31, 0.64, -0.27, -0.33, 0.67, -0.02, -0.05, -0.84, -0.69, 0.3, -0.46, -0.03, 0.5, 0.73, -0.52, 0.9, 0.4, -0.78, -0.56]]
columns = [r'$x$', r'$y$', r'$\sigma_y$', r'$\sigma_x$', r'$\rho_{xy}$']
table_data = pd.DataFrame(data).T
table_data.columns = columns
table_data
| $x$ | $y$ | $\sigma_y$ | $\sigma_x$ | $\rho_{xy}$ | |
|---|---|---|---|---|---|
| 0 | 201.0 | 592.0 | 61.0 | 9.0 | -0.84 |
| 1 | 244.0 | 401.0 | 25.0 | 4.0 | 0.31 |
| 2 | 47.0 | 583.0 | 38.0 | 11.0 | 0.64 |
| 3 | 287.0 | 402.0 | 15.0 | 7.0 | -0.27 |
| 4 | 203.0 | 495.0 | 21.0 | 5.0 | -0.33 |
| 5 | 58.0 | 173.0 | 15.0 | 9.0 | 0.67 |
| 6 | 210.0 | 479.0 | 27.0 | 4.0 | -0.02 |
| 7 | 202.0 | 504.0 | 14.0 | 4.0 | -0.05 |
| 8 | 198.0 | 510.0 | 30.0 | 11.0 | -0.84 |
| 9 | 158.0 | 416.0 | 16.0 | 7.0 | -0.69 |
| 10 | 165.0 | 393.0 | 14.0 | 5.0 | 0.30 |
| 11 | 201.0 | 442.0 | 25.0 | 5.0 | -0.46 |
| 12 | 157.0 | 317.0 | 52.0 | 5.0 | -0.03 |
| 13 | 131.0 | 311.0 | 16.0 | 6.0 | 0.50 |
| 14 | 166.0 | 400.0 | 34.0 | 6.0 | 0.73 |
| 15 | 160.0 | 337.0 | 31.0 | 5.0 | -0.52 |
| 16 | 186.0 | 423.0 | 42.0 | 9.0 | 0.90 |
| 17 | 125.0 | 334.0 | 26.0 | 8.0 | 0.40 |
| 18 | 218.0 | 533.0 | 16.0 | 6.0 | -0.78 |
| 19 | 146.0 | 344.0 | 22.0 | 5.0 | -0.56 |
We can see all data displayed in a 2D plot...
fig, ax = plt.subplots(1, 1, figsize=(8, 6))
x_column = table_data[r'$x$'].values
y_column = table_data[r'$y$'].values
sigma_y_column = table_data[r'$\sigma_y$'].values
ax.errorbar(x_column, y_column, yerr=sigma_y_column, fmt='o', label='Data', capsize=5, markersize=5, alpha=0.5, zorder=0)
ax.scatter(x_column, y_column, color='blue', s=15, zorder=1)
x_array = np.linspace(40, 300, 1000)
ax.set_xlabel(r'$x$')
ax.set_ylabel(r'$y$')
ax.grid(False)
In linear regression, we typically model the relationship between a dependent variable $Y$ and an independent variable $X$ with the following linear equation:
\begin{equation} Y = X A + \varepsilon \end{equation}
where $Y$ is the vector of observed values, $A$ is the vector of coefficients (intercept $b$ and slope $m$), and $\varepsilon$ is the vector of residuals (or errors) between the observed and predicted values.
Then, we have some method to fit the linear regression to the data, such as Ordinary Least Squares or Weighted Least Squares (Appendix A).
# extract columns
x = table_data[r'$x$'].values
y = table_data[r'$y$'].values
sigma_y = table_data[r'$\sigma_y$'].values
WLS = algebraic_method(x, y, sigma_y).matrix_weighted_least_squares_fit()
OLS = algebraic_method(x, y, sigma_y).matrix()
results_OLS = algebraic_method(x, y, sigma_y).get_results('OLS')
**OLS** Residual variance: 10817.46255649714 sigma_m: 0.42392020277556824 sigma_b: 76.99801875138671 Covariance matrix: [[ 5.92869489e+03 -3.11164988e+01] [-3.11164988e+01 1.79708338e-01]] X matrix: [213.27349198 1.07674752] y = (1.076747524168327 ± 0.09479143904416655) x + (213.27349197596124 ± 17.217280406090417)
results_WLS = algebraic_method(x, y, sigma_y).get_results('WLS')
**WLS** Residual variance: 10817.462556497163 sigma_m: 0.42392020277556874 sigma_b: 76.9980187513868 Covariance matrix: [[ 5.92869489e+03 -3.11164988e+01] [-3.11164988e+01 1.79708338e-01]] X matrix: [213.27349198 1.07674752] y = (1.0767475241683306 ± 0.09479143904416668) x + (213.27349197596038 ± 17.217280406090435)
This first method uses the ordinary least squares (OLS) approach, assuming that all data points are equally weighted. In this method, the residual sum of squares is used to calculate the variance of the parameters and the covariance matrix.
This method is used when data points have different weights. In the Weighted least squares (WLS) approach, each data point is assigned a weight, and those with higher weights are considered more important in the regression. This method adjusts the variance calculations based on the weights of the observations.
_ = algebraic_method(x, y, sigma_y).plot_results(color_OLS='r', color_WLS='orange', reduced_string=None)
Now, we try to fit the data to the linear regression model but using just the "good" points, excluding the possible outliers.
x_red, y_red, sigma_y_red = algebraic_method(x, y, sigma_y).reduced_data()
results_OLS_red = algebraic_method(x_red, y_red, sigma_y_red).get_results('OLS')
**OLS** Residual variance: 1034.486575467791 sigma_m: 0.2031879240031288 sigma_b: 35.020383303786794 Covariance matrix: [[ 1.22642725e+03 -6.92561452e+00] [-6.92561452e+00 4.12853325e-02]] X matrix: [34.04772776 2.23992083] y = (2.2399208316310975 ± 0.0507969810007822) x + (34.04772775754255 ± 8.755095825946698)
results_WLS_red = algebraic_method(x_red, y_red, sigma_y_red).get_results('WLS')
**WLS** Residual variance: 1034.4865754677583 sigma_m: 0.20318792400312558 sigma_b: 35.02038330378624 Covariance matrix: [[ 1.22642725e+03 -6.92561452e+00] [-6.92561452e+00 4.12853325e-02]] X matrix: [34.04772776 2.23992083] y = (2.2399208316310877 ± 0.050796981000781394) x + (34.04772775754262 ± 8.75509582594656)
fig, ax = algebraic_method(x, y, sigma_y).plot_results(color_OLS='red', color_WLS='orange', reduced_string=None)
_ = algebraic_method(x_red, y_red, sigma_y_red).plot_results(color_OLS='blue', color_WLS='deepskyblue', reduced_string=True, fig=fig, ax=ax)
We see clear differences between both linear regression models. In the case of using all the data, we see how the model is biased by the clear outliers that move away from the trend of the rest of the data, and when using the reduced data, we can see that the model seems to make more sense making the residual variance is up to 10 times less than using all the data.
Part 2: how to fit data to a model (bayesian way)¶
Part 2.0: introduction to bayesian inference¶
As an introduction (or reminder) to bayesian inference, we have that bayesian statistics is a framework for statistical inference in which we update our beliefs about a hypothesis (or model parameters) in the light of new evidence. The key principle of Bayesian statistics is Bayes' theorem, which allows us to calculate the posterior probability of a hypothesis given the observed data.
The posterior distribution, denoted by $P(\theta | D)$, represents our updated belief about the parameter $\theta$ after observing the data $D$. It is computed using Bayes' theorem:
\begin{equation} P(\theta | D) = \frac{P(D | \theta) P(\theta)}{P(D)} \end{equation}
Where:
- $P(\theta | D)$ is the posterior probability of the parameter $\theta$ given the data $D$,
- $P(D | \theta)$ is the likelihood of the data given the parameter $\theta$,
- $P(\theta)$ is the prior probability of $\theta$, representing our initial belief about $\theta$ before observing the data,
- $P(D)$ is the marginal likelihood or evidence, which normalizes the result to ensure the posterior is a valid probability distribution.
The likelihood function $\mathcal{L} = P(D | \theta)$ quantifies how probable the observed data is given different values of the parameter $\theta$. The prior distribution $P(\theta)$ represents our knowledge or belief about the parameter before seeing the data. The posterior distribution combines these two pieces of information to provide a new distribution for $\theta$, which incorporates both our prior belief and the observed data.
In practice, bayesian inference is used to estimate parameters, predict future data, or perform model comparison. One of the key features of Bayesian statistics is that it provides a full probability distribution for the parameters, not just a point estimate, which allows us to quantify uncertainty in a principled way.
In bayesian linear regression, we treat the model parameters (slope $m$ and intercept $b$) as random variables with prior distributions. Given a set of data points $(x_i, y_i)$, the likelihood function $\mathcal{L}$ is given by:
\begin{equation} \mathcal{L} = \prod_{i=1}^{N} \frac{1}{\sqrt{2\pi\sigma_i^2}} \exp{\left(-\frac{(y_i - m x_i - b)^2}{2\sigma_i^2}\right)} = P\left(\{y_i\}^N_{i=1}|m,b,I\right) \end{equation} where $\sigma_i$ represents the standard deviation of the measurement error for each point. This likelihood function quantifies how likely the observed data is for a given set of model parameters $m$ and $b$.
Using Bayes' theorem, we can update our prior beliefs about $m$ and $b$ in light of the data and obtain the posterior distribution for these parameters. This allows us to make predictions and quantify the uncertainty of those predictions: \begin{equation} P\left(m,b|\{y_i\}^N_{i=1},I\right) = \dfrac{P\left(\{y_i\}^N_{i=1}|m,b,I\right) \: P(m,b|I)}{P(\{y_i\}^N_{i=1}|I)} \end{equation}
In $\texttt{PyMC}$, we use a probabilistic programming framework to define bayesian models. The syntax follows the structure of a probabilistic model, where we specify the priors, likelihood, and perform inference using sampling methods. Below is an explanation of the key components and the flow of the code:
Model Definition: the model is defined within a context manager called $\texttt{Model()}$. This context ensures that all variables and computations defined inside it are associated with the same model.
\begin{equation} \texttt{with Model() as model:} % Model specifications \end{equation}
Priors: priors represent our beliefs or assumptions about the parameters before observing any data. In this example, three priors are defined for the model parameters: $ \sigma $, $ b $ (intercept), and $ m $ (slope).
\begin{equation} \sigma = \texttt{Normal}(\texttt{r'$\sigma$'}, \sigma=20) \end{equation}
\begin{equation} \texttt{intercept} = \texttt{Normal}(\texttt{r'$b$'}, 0, \sigma=20) \end{equation}
\begin{equation} \texttt{slope} = \texttt{Normal}(\texttt{r'$m$'}, 0, \sigma=20) \end{equation}
Each prior is defined using a distribution. Here, a normal distribution is used for each parameter:
- $ \sigma $ is the standard deviation, with a prior of $ \mathcal{N}(0, 20) $.
- $ b $ (intercept) has a prior $ \mathcal{N}(0, 20) $.
- $ m $ (slope) also has a prior $ \mathcal{N}(0, 20) $.
These priors reflect our initial uncertainty about the values of the parameters.
Likelihood: the likelihood function defines the probability of the observed data given the model parameters. In this case, the data is assumed to come from a normal distribution, where the mean is given by the linear relationship $ \mu = b + m \cdot x $, and the standard deviation is $ \sigma $. The likelihood is defined as follows:
\begin{equation} \texttt{likelihood} = \texttt{Normal}(\texttt{'y'}, \mu=\texttt{intercept} + \texttt{slope} \cdot x_{\texttt{red}}, \sigma=\sigma, \texttt{observed}=y_{\texttt{red}}) \end{equation}
Here:
- $ y $ is the observed data.
- The mean of the normal distribution is $ \mu = b + m \cdot x $.
- The observed data is $ y_{\text{red}} $.
- The standard deviation $ \sigma $ is passed as a parameter to the likelihood.
The likelihood tells us how probable the observed data is given the parameters $ b $, $ m $, and $ \sigma $.
Inference: once the priors and likelihood are defined, we use an inference method to sample from the posterior distribution of the parameters. In this case, we use the $\texttt{sample()}$ function, which draws samples from the posterior distribution using Hamiltonian Monte Carlo (HMC) with the No-U-Turn Sampler (NUTS). This sampling method explores the parameter space efficiently.
\begin{equation} \texttt{idata} = \texttt{sample}(1000) \end{equation}
The $\texttt{sample()}$ function generates 1000 samples from the posterior distribution, which represent the possible values of the parameters $ b $, $ m $, and $ \sigma $ after observing the data.
Summary:
- Defining priors for the model parameters ($ b $, $ m $, $ \sigma $).
- Defining the likelihood function, which specifies how the data is generated given the parameters.
- Using a sampling algorithm (NUTS) to infer the posterior distribution of the parameters based on the observed data.
By sampling from the posterior, $\texttt{PyMC}$ provides a way to quantify uncertainty in the parameters and make predictions.
$\texttt{pm.sample}$: this function in $\texttt{PyMC}$ is used for drawing samples from the posterior distribution of the specified Bayesian model. It performs Markov Chain Monte Carlo (MCMC) sampling to approximate the posterior distribution. The key parameters include:
$\texttt{draws}$: the number of samples to draw from the posterior. It specifies the number of samples that the algorithm will use to approximate the posterior distribution of the model parameters. In general, the more samples you draw, the better the approximation, but it also increases the computational time.
$\texttt{tune}$: the number of initial samples to be discarded as tuning or burn-in samples. These samples are used to adapt the sampler to the posterior distribution. During this phase, the sampler adapts to the parameter space and adjusts itself to explore the high-probability regions of the posterior distribution. These discarded samples are not included in the final results, as they are only used to stabilize the sampling process.
$\texttt{target\_accept}$: a target acceptance probability for the sampler. It helps control the step size and is useful for efficient exploration of the parameter space. This parameter controls the acceptance rate of the NUTS (No-U-Turn Sampler) algorithm. It specifically sets the target acceptance rate for proposals in the sampling process. In the context of NUTS, this refers to the probability that a proposed step in the parameter space will be accepted. A common target is 0.8, meaning the sampler will aim to accept 80% of the proposed steps for efficient exploration.
$\texttt{chains}$: the number of chains to run in parallel. This parameter specifies the number of Markov chains to run in parallel. Using multiple chains is useful for checking convergence, i.e., ensuring that all chains have explored the parameter space and that they converge to the same posterior distribution. Typically, at least two chains are run in parallel to verify that all chains reach the same conclusion.
Part 2.1: first steps with $\texttt{emcee}$¶
As we want to compare how bayesian inference is dealt with $\texttt{pymc}$ compared to $\texttt{emcee}$, it is recommended to create the same simple linear regression model with $\texttt{emcee}$ and to execute the sampler.
# log-likelihood function
def log_likelihood(theta, x, y):
intercept, slope, sigma = theta # sigma is now inferred
if sigma <= 0: # avoid negative values for sigma
return -np.inf
model = intercept + slope * x
return -0.5 * np.sum(((y - model) / sigma) ** 2 + np.log(2 * np.pi * sigma**2))
# log-prior function
def log_prior(theta):
intercept, slope, sigma = theta
if sigma <= 0: # sigma must be positive
return -np.inf
# normal priors for intercept and slope, log-normal prior for sigma
return -0.5 * (intercept**2 / 20**2 + slope**2 / 20**2) - np.log(sigma)
# total log-probability function
def log_probability(theta, x, y):
lp = log_prior(theta)
if not np.isfinite(lp):
return -np.inf
return lp + log_likelihood(theta, x, y)
# sampler configuration
ndim = 3 # number of parameters: intercept, slope, sigma
nwalkers = 50 # number of walkers
p0 = np.random.randn(nwalkers, ndim) * 0.1 # initial conditions near 0
# ensure sigma starts with positive values
p0[:, 2] = np.abs(p0[:, 2]) + 1 # avoid negative or zero values for sigma
# create and run the sampler
sampler = emcee.EnsembleSampler(nwalkers, ndim, log_probability, args=(x_red, y_red))
sampler.run_mcmc(p0, 1000, progress=True) # 1000 iterations
# retrieve results
samples = sampler.get_chain(discard=200, thin=10, flat=True) # adjust as needed
100%|███████████████████████████████████████████████████| 1000/1000 [00:01<00:00, 835.88it/s]
# plot traces for each parameter
fig, axes = plt.subplots(ndim, figsize=(10, 10), sharex=True)
labels = [r"$b$", r"$m$", r"$\sigma$"]
for i in range(ndim):
axes[i].plot(samples[:, i], "k", alpha=0.3)
axes[i].set_ylabel(labels[i])
axes[-1].set_xlabel("Step")
plt.show()
# corner plot to visualize parameter distributions
fig = corner.corner(samples, labels=labels, truths=np.median(samples, axis=0))
plt.show()
param_medians = np.median(samples, axis=0)
param_std = np.std(samples, axis=0)
labels = ["Intercept", "Slope", "Sigma"]
for i in range(ndim):
print(f"{labels[i]}: {param_medians[i]:.3f} ± {param_std[i]:.3f}")
Intercept: 8.348 ± 17.802 Slope: 2.334 ± 0.111 Sigma: 30.551 ± 6.462
It is possible to create a model where sigma is one more observational data, added to the x, y pairs. In this way, the Bayesian model will not infer the sigma parameter but will use it as observed data (Appendix C).
If $\sigma$ is a parameter to be inferred: The model is more flexible, since it considers the uncertainty in $\sigma$, which leads to greater uncertainty in estimates of intercept and slope. When $\sigma$ is observed the model is simplified, because it is no longer necessary to infer $\sigma$, which reduces the uncertainty in the regression parameters, but you also lose the ability to model the variability of $\sigma$. This may be appropriate when you have a good estimate of $\sigma$ and you don't need to infer it.
# log-likelihood function with observed sigma (sigma_y_obs)
def log_likelihood(theta, x, y, sigma_y_obs):
intercept, slope = theta # sigma is now observed, so not in the theta
model = intercept + slope * x
# use the observed sigma_y_obs instead of inferred sigma
return -0.5 * np.sum(((y - model) / sigma_y_obs) ** 2 + np.log(2 * np.pi * sigma_y_obs**2))
# log-prior function (no sigma to infer)
def log_prior(theta):
intercept, slope = theta
# Normal priors for intercept and slope
return -0.5 * (intercept**2 / 20**2 + slope**2 / 20**2)
# total log-probability function (using observed sigma)
def log_probability(theta, x, y, sigma_y_obs):
lp = log_prior(theta)
if not np.isfinite(lp):
return -np.inf
return lp + log_likelihood(theta, x, y, sigma_y_obs)
# sampler configuration
ndim = 2 # number of parameters: intercept and slope (sigma is now observed)
nwalkers = 50 # number of walkers
p0 = np.random.randn(nwalkers, ndim) * 0.1 # initial conditions near 0
# create and run the sampler
sampler_simpl = emcee.EnsembleSampler(nwalkers, ndim, log_probability, args=(x_red, y_red, sigma_y_red))
sampler_simpl.run_mcmc(p0, 1000, progress=True) # 1000 iterations
# retrieve results
samples_simpl = sampler_simpl.get_chain(discard=200, thin=10, flat=True) # adjust as needed
100%|███████████████████████████████████████████████████| 1000/1000 [00:01<00:00, 821.08it/s]
# plot traces for each parameter
fig, axes = plt.subplots(ndim, figsize=(10, 6), sharex=True)
labels = [r"$b$", r"$m$", r"$\sigma$"]
for i in range(ndim):
axes[i].plot(samples_simpl[:, i], "k", alpha=0.3)
axes[i].set_ylabel(labels[i])
axes[-1].set_xlabel("Step")
plt.show()
# corner plot to visualize parameter distributions
fig = corner.corner(samples_simpl, labels=labels, truths=np.median(samples_simpl, axis=0))
plt.show()
param_medians_simpl = np.median(samples_simpl, axis=0)
param_std_simpl = np.std(samples_simpl, axis=0)
labels = ["Intercept", "Slope", "Sigma"]
for i in range(ndim):
print(f"{labels[i]}: {param_medians_simpl[i]:.3f} ± {param_std_simpl[i]:.3f}")
Intercept: 18.171 ± 13.654 Slope: 2.329 ± 0.084
Part 2.2: what about $\texttt{pymc}$?¶
Now, should we try with $\texttt{pymc}$?
with Model() as model: #especificaciones del modelo en un with
#priors
sigma = Normal(r'$\sigma$', sigma=20)
intercept = Normal(r'$b$', 0, sigma=20)
slope = Normal(r'$m$', 0, sigma=20)
#likelihood
likelihood = Normal('y', mu=intercept + slope * x_red, sigma=sigma, observed=y_red)
#inference
#draw 1000 samples for posteriors using NUTS sampling
idata = sample(1000)
Auto-assigning NUTS sampler... Initializing NUTS using jitter+adapt_diag... Multiprocess sampling (4 chains in 4 jobs) NUTS: [$\sigma$, $b$, $m$]
Sampling 4 chains for 1_000 tune and 1_000 draw iterations (4_000 + 4_000 draws total) took 4 seconds.
plt.rcParams.update({'font.size': 14, 'axes.linewidth': 1, 'axes.edgecolor': 'k'})
#if you want to see the traceability of Markov chains for each parameter
az.plot_trace(idata, figsize=(10, 10), legend=False)
#trace plot axis
axes = plt.gcf().get_axes()
#specifiy chain colors
chain_colors = ['red', 'blue', 'green', 'purple']
#modify styles and colors for the chains
for ax in axes:
lines = ax.get_lines()
for i, line in enumerate(lines):
line.set_color(chain_colors[i % len(chain_colors)])
line.set_linestyle('-')
plt.legend(labels=['Chain 0', 'Chain 1', 'Chain 2', 'Chain 4'], frameon=True, bbox_to_anchor=(1.05, 1), loc='upper left', fontsize=10)
plt.show()
warnings.simplefilter("ignore", category=UserWarning)
_ = az.plot_pair(idata,
var_names=[r'$\sigma$', r'$b$', r'$m$'],
kind=['scatter', 'kde'],
divergences=True,
kde_kwargs={'fill_last': False},
marginals=True,
point_estimate='median',
textsize=20)
/pcdisk/dalen/lgonzalez/OneDrive/Escritorio/CAB_INTA-CSIC/01-Doctorado/013-Jupyter/0131-inference_learning/bayesian_env/lib64/python3.11/site-packages/arviz/plots/pairplot.py:232: FutureWarning: The return type of `Dataset.dims` will be changed to return a set of dimension names in future, in order to be more consistent with `DataArray.dims`. To access a mapping from dimension names to lengths, please use `Dataset.sizes`. gridsize = int(dataset.dims["draw"] ** 0.35)
Now, to be able to easily see all the inferred parameters and their statistical parameters, $\texttt{arviz}$ provides us with the perfect tool with the summary function (Appendix D).
# statistics summary from samples obtained in inference process
print("Summary of posterior sampling:")
az.summary(idata)
Summary of posterior sampling:
| mean | sd | hdi_3% | hdi_97% | mcse_mean | mcse_sd | ess_bulk | ess_tail | r_hat | |
|---|---|---|---|---|---|---|---|---|---|
| $\sigma$ | 30.130 | 5.129 | 21.484 | 40.165 | 0.117 | 0.086 | 2073.0 | 1915.0 | 1.0 |
| $b$ | 7.891 | 17.364 | -24.611 | 39.995 | 0.472 | 0.334 | 1356.0 | 1641.0 | 1.0 |
| $m$ | 2.335 | 0.108 | 2.131 | 2.531 | 0.003 | 0.002 | 1405.0 | 1931.0 | 1.0 |
As we did before with $\texttt{emcee}$, the $\texttt{pymc}$ model can be alse written with $\sigma$ as observed data also. The way to make the model is as follows.
with Model() as model_simpl: #especificaciones del modelo en un with
#priors
intercept = Normal(r'$b$', 0, sigma=20)
slope = Normal(r'$m$', 0, sigma=20)
#likelihood
likelihood = Normal('y', mu=intercept + slope * x_red, sigma=sigma_y_red, observed=y_red)
#inference
#draw 1000 samples
idata_simpl = sample(1000)
Auto-assigning NUTS sampler... Initializing NUTS using jitter+adapt_diag... Multiprocess sampling (4 chains in 4 jobs) NUTS: [$b$, $m$]
Sampling 4 chains for 1_000 tune and 1_000 draw iterations (4_000 + 4_000 draws total) took 3 seconds.
warnings.simplefilter("ignore", category=FutureWarning)
az.plot_trace(idata_simpl, figsize=(10, 7), legend=False)
plt.rcParams.update({'font.size': 14, 'axes.linewidth': 1, 'axes.edgecolor': 'k'})
#obtain trace axis objects
axes = plt.gcf().get_axes()
#chain colors
chain_colors = ['red', 'blue', 'green', 'purple']
#modify colors and linestyles
for ax in axes:
lines = ax.get_lines()
for i, line in enumerate(lines):
line.set_color(chain_colors[i % len(chain_colors)])
line.set_linestyle('-')
plt.legend(labels=['Chain 0', 'Chain 1', 'Chain 2', 'Chain 4'], frameon=True, bbox_to_anchor=(1.05, 1), loc='upper left', fontsize=10)
plt.show()
_ = az.plot_pair(idata_simpl,
var_names=[r'$b$', r'$m$'],
kind=['scatter', 'kde'],
divergences=True,
kde_kwargs={'fill_last': False},
marginals=True,
point_estimate='median',
textsize=11)
# statistics summary from samples obtained in inference process
print("Summary of posterior sampling:")
az.summary(idata_simpl)
Summary of posterior sampling:
| mean | sd | hdi_3% | hdi_97% | mcse_mean | mcse_sd | ess_bulk | ess_tail | r_hat | |
|---|---|---|---|---|---|---|---|---|---|
| $b$ | 18.323 | 13.679 | -7.775 | 44.040 | 0.437 | 0.310 | 993.0 | 933.0 | 1.0 |
| $m$ | 2.330 | 0.084 | 2.171 | 2.488 | 0.003 | 0.002 | 1007.0 | 904.0 | 1.0 |
##### Inference data #####
idata.posterior['y_model'] = idata.posterior[r'$b$'] + idata.posterior[r'$m$'] * xr.DataArray(x_red)
idata_simpl.posterior['y_model'] = idata_simpl.posterior[r'$b$'] + idata_simpl.posterior[r'$m$'] * xr.DataArray(x_red)
##### test x array #####
x_array = np.linspace(40, 300, 1000)
##### Regression posterior (PyMC) #####
regression_median = idata.posterior[r'$b$'].median().values + idata.posterior[r'$m$'].median().values * x_red
regression_median_simpl = idata_simpl.posterior[r'$b$'].median().values + idata_simpl.posterior[r'$m$'].median().values * x_red
b_std = idata.posterior[r'$b$'].std().values
m_std = idata.posterior[r'$m$'].std().values
b = idata.posterior[r'$b$'].median().values
m = idata.posterior[r'$m$'].median().values
b_std_simpl = idata_simpl.posterior[r'$b$'].std().values
m_std_simpl = idata_simpl.posterior[r'$m$'].std().values
b_simpl = idata_simpl.posterior[r'$b$'].median().values
m_simpl = idata_simpl.posterior[r'$m$'].median().values
##### Regression posterior (emcee) #####
param_medians = np.median(samples, axis=0) # Medianas de los parámetros
param_std = np.std(samples, axis=0) # Desviaciones estándar de los parámetros
# emcee parameters: intercept (param_medians[0]), slope (param_medians[1])
intercept_emcee = param_medians[0]
slope_emcee = param_medians[1]
# linear regression from emcee
regression_emcee = intercept_emcee + slope_emcee * x_array
param_medians_simpl = np.median(samples_simpl, axis=0) # Medianas de los parámetros
param_std_simpl = np.std(samples_simpl, axis=0) # Desviaciones estándar de los parámetros
# emcee parameters: intercept (param_medians[0]), slope (param_medians[1])
intercept_emcee_simpl = param_medians_simpl[0]
slope_emcee_simpl = param_medians_simpl[1]
# linear regression from emcee
regression_emcee_simpl = intercept_emcee_simpl + slope_emcee_simpl * x_array
##### Create figure #####
plt.rcParams.update({'font.size': 11, 'axes.linewidth': 1, 'axes.edgecolor': 'k'})
# create subplot
_, ax = plt.subplots(figsize=(8, 6))
# custom scatter data features
ax.errorbar(x_red, y_red, yerr=sigma_y_red, fmt='o', label='Reduced data', capsize=5, markersize=5)
ax.set_xlabel(r'$x$')
ax.set_ylabel(r'$y$')
X_red, A_red, Y_red, C_red, sigma_m_red, sigma_b_red = results_OLS_red
ax.plot(x_array, X_red[1]*x_array + X_red[0], linewidth=1, label='Reduced')
ax.text(55, 500, f'$y = ({round(X_red[1],3)} \pm {round(sigma_m_red/np.sqrt(16),3)}) x + ({round(X_red[0],3)} \pm {round(sigma_b_red/np.sqrt(16),3)})$', fontsize=10, color='orange')
# PyMC
ax.text(55, 675, f'$y = ({round(float(m), 3)} \pm {round(float(m_std), 3)}) x + ({round(float(b),3)} \pm {round(float(b_std), 3)})$', fontsize=10, color='r')
ax.plot(x_red, regression_median, color='r', linestyle='-', linewidth=1, label='pymc regression')
ax.text(55, 645, '$y_{simpl}$' + f'$= ({round(float(m_simpl), 3)} \pm {round(float(m_std_simpl), 3)}) x + ({round(float(b_simpl),3)} \pm {round(float(b_std_simpl), 3)})$', fontsize=10, color='r')
ax.plot(x_red, regression_median_simpl, color='r', linestyle='--', linewidth=1, label='pymc regression (simpl.)')
# emcee
ax.plot(x_array, regression_emcee, color='blue', linestyle='-', label='emcee Regression', linewidth=1)
ax.text(55, 575, f'$y = ({round(slope_emcee, 3)} \pm {round(param_std[1], 3)}) x + ({round(intercept_emcee, 3)} \pm {round(param_std[0], 3)})$', fontsize=10, color='blue')
ax.plot(x_array, regression_emcee_simpl, color='blue', linestyle='--', label='emcee Regression (simpl.)', linewidth=1)
ax.text(55, 545, '$y_{simpl}$' + f'$= ({round(slope_emcee_simpl, 3)} \pm {round(param_std_simpl[1], 3)}) x + ({round(intercept_emcee_simpl, 3)} \pm {round(param_std_simpl[0], 3)})$', fontsize=10, color='blue')
# adjust axis and labels
ax.set_title('Bayesian model to reduced data')
ax.set_xlabel(r'$x$')
ax.set_ylabel(r'$y$')
ax.legend(frameon=True, bbox_to_anchor=(1.05, 1), loc='upper left')
ax.grid(zorder=0)
ax.grid(False)
ax.set_xlim(50, 225)
ax.tick_params(axis='both', direction='out')
plt.show()
Part 3: elegantly plot in Bayesian statistics¶
In this section, we are going to deal with some useful and elegant ways to plot the results provided by the inference process. Some of these well-known plot types show different aspects of the probability distributions that are handled in the Bayesian model.
1. Summary of ArviZ Functions:
Below is a detailed summary of the ArviZ functions you mentioned, explaining what they do, what they plot, and their mathematical meaning:
1. $\texttt{az.plot\_kde}$:
What does it do?: This function generates a kernel density estimate (KDE) plot, either one-dimensional or two-dimensional. KDE is a non-parametric technique to estimate the probability density function of a random variable.
What does it plot?: It plots the marginal distribution of the parameters of interest. In 2D, it shows the interaction between two parameters, and in 1D, it shows the distribution of a single parameter.
Mathematical meaning: The kernel density estimate (KDE) is based on a set of samples $\{x_1, x_2, \dots, x_n\}$ from a distribution. For each point $x$, the density estimate is given by:
\begin{equation} \hat{f}(x) = \frac{1}{nh} \sum_{i=1}^n K\left(\frac{x - x_i}{h}\right) \end{equation}
where $K$ is a kernel function (commonly a normal distribution) and $h$ is the bandwidth parameter that controls the smoothness of the estimate.
plt.rcParams.update({'font.size': 11, 'axes.linewidth': 1, 'axes.edgecolor': 'k'})
# define levels for contour plot
levels = [0.1, 0.3, 0.5, 0.7, 0.9] # you can adjust these values as needed
mean_b = np.mean(idata.posterior[r'$b$'].values)
mean_m = np.mean(idata.posterior[r'$m$'].values)
##### kernel density estimate (KDE) plot #####
az.plot_kde(idata.posterior[r'$b$'], values2=idata.posterior[r'$m$'], contour=True,
contour_kwargs={"colors":None, "cmap":plt.cm.autumn},
contourf_kwargs={"alpha":0}, figsize=(8, 6.5)) # add levels parameter to control contour levels
# obtain current axis object
ax = plt.gca()
# add the mean point to the plot
ax.scatter(mean_b, mean_m, color='red', label='Mean', zorder=5)
# custom scatter data features
ax.set_xlabel(r'$b$')
ax.set_ylabel(r'$m$')
ax.set_title('2D KDE Contour Map')
ax.tick_params(axis='both', direction='out')
plt.show()
##### kernel density estimate (KDE) plot #####
az.plot_kde(idata.posterior[r'$b$'], values2=idata.posterior[r'$m$'], contour=True,
contourf_kwargs={"alpha":1, "cmap":plt.cm.autumn},
contour_kwargs={"colors":'k'}, figsize=(8, 8)) # add levels parameter to control contour levels
# obtain current axis object
ax = plt.gca()
# add the mean point to the plot
ax.scatter(mean_b, mean_m, color='red', label='Mean', zorder=5)
# custom scatter data features
ax.set_xlabel(r'$b$')
ax.set_ylabel(r'$m$')
# acquire mappable object from contour plot
contour = ax.collections[0] # it is assumed that mappable object is in the first collection
# add color legend at 'bottom' and adjust location
cb = plt.colorbar(contour, label='Probability density', orientation='horizontal', pad=0.1)
cb.ax.xaxis.label.set_fontsize(12)
cb.ax.tick_params(axis='x', labelsize=12)
ax.set_title('2D KDE Contour Map')
ax.tick_params(axis='both', direction='out')
plt.show()
##### kernel density estimate (KDE) plot #####
az.plot_kde(idata.posterior[r'$b$'], values2=idata.posterior[r'$m$'], contour=True,
contourf_kwargs={"alpha":1, "cmap":plt.cm.viridis}, figsize=(8, 8)) # add levels parameter to control contour levels
# obtain current axis object
ax = plt.gca()
# add the mean point to the plot
ax.scatter(mean_b, mean_m, color='red', label='Mean', zorder=5)
# custom scatter data features
ax.set_xlabel(r'$b$')
ax.set_ylabel(r'$m$')
# acquire mappable object from contour plot
contour = ax.collections[0] # it is assumed that mappable object is in the first collection
# add color legend at 'bottom' and adjust location
cb = plt.colorbar(contour, label='Probability density', orientation='horizontal', pad=0.1)
cb.ax.xaxis.label.set_fontsize(12)
cb.ax.tick_params(axis='x', labelsize=12)
ax.set_title('2D KDE Contour Map')
ax.tick_params(axis='both', direction='out')
plt.show()
##### kernel density estimate (KDE) plot #####
az.plot_kde(idata.posterior[r'$b$'], values2=idata.posterior[r'$m$'], contour=False, figsize=(8, 8))
# obtain current axis object
ax = plt.gca()
# add the mean point to the plot
ax.scatter(mean_b, mean_m, color='red', label='Mean', zorder=5)
# custom scatter data features
ax.set_xlabel(r'$b$')
ax.set_ylabel(r'$m$')
# acquire mappable object from contour plot
contour = ax.collections[0] # it is assumed that mappable object is in the first collection
# add color legend at 'bottom' and adjust location
cb = plt.colorbar(contour, label='Probability density', orientation='horizontal', pad=0.1)
cb.ax.xaxis.label.set_fontsize(12)
cb.ax.tick_params(axis='x', labelsize=12)
ax.set_title('2D KDE Contour Map')
ax.tick_params(axis='both', direction='out')
plt.show()
2a. $\texttt{az.loo}$:
What does it do?: This function performs Leave-One-Out Cross-Validation (LOO) using Pareto-smoothed importance sampling (PSIS-LOO). LOO evaluates the model's performance using only one observation at a time for validation.
What does it plot?: Although there is no direct plot, the function returns a structure containing log-likelihood estimates and the LOO value. These can be used to create a custom plot showing the model's quality.
Mathematical meaning: The LOO value is calculated as:
\begin{equation} \text{LOO} = \sum^N_{i=1} \log p(y_i \mid \hat{\theta}_{-i}) \end{equation}
where $\hat{\theta}_{-i}$ is the parameter estimate when the $i$-th observation is omitted.
2b. $\texttt{az.plot\_khat}$:
What does it do?: It plots the $k_{\hat{\theta}}$ values, which are diagnostic metrics used to detect model fit problems. $k_{\hat{\theta}}$ refers to the divergence diagnostic statistic at each point in the posterior model. This parameter is obtained in the LOO Cross-Validation.
What does it plot?: It displays a histogram of the $k_{\hat{\theta}}$ values, indicating how well the model fits the observations.
Mathematical meaning: The $k_{\hat{\theta}}$ value measures the discrepancy between the posterior simulation and the true model distribution, typically using a divergence distance.
Firstly, we need to retrieve for the inference data the likelihood sampled, so we open the model and we sample the likelihood, extending $\texttt{idata}$.
with model:
idata = pm.sample(return_inferencedata=True)
idata.extend(pm.compute_log_likelihood(idata))
Auto-assigning NUTS sampler... Initializing NUTS using jitter+adapt_diag... Multiprocess sampling (4 chains in 4 jobs) NUTS: [$\sigma$, $b$, $m$]
Sampling 4 chains for 1_000 tune and 1_000 draw iterations (4_000 + 4_000 draws total) took 4 seconds.
loo = az.loo(idata, pointwise=True)
az.plot_khat(loo.pareto_k, textsize=11)
plt.show()
3. $\texttt{az.plot\_pair}$:
What does it do?: This function generates a pair plot (scatterplot matrix) showing the relationships between all pairs of parameters in the model. It helps visualize the correlations and structure of the model samples.
What does it plot?: It shows the relationships between the variables (model parameters), using scatter plots for pairwise combinations of two variables, and histograms or density plots for individual variables.
Mathematical meaning: This plot is used to visualize the dependencies between variables, and the correlations between them. Mathematically, the relationship between two parameters $m$ and $b$ is measured using the Pearson correlation coefficient $r$:
\begin{equation} r = \frac{\sum_{i=1}^{n} (x_i - \bar{x})(y_i - \bar{y})}{\sqrt{\sum_{i=1}^{n} (x_i - \bar{x})^2 \sum_{i=1}^{n} (y_i - \bar{y})^2}} \end{equation}
warnings.simplefilter("ignore", category=FutureWarning)
_ = az.plot_pair(idata,
var_names=[r'$\sigma$', r'$b$', r'$m$'],
kind='kde',
marginals=False,
divergences=True,
point_estimate='mode',
textsize=16)
_ = az.plot_pair(idata_simpl,
var_names=[r'$b$', r'$m$'],
kind='kde',
marginals=True,
divergences=True,
point_estimate='median',
textsize=11)
4. $\texttt{az.plot\_autocorr}$:
What does it do?: This function generates an autocorrelation plot of the model samples, showing how the observations are correlated across different iterations of the Markov chain.
What does it plot?: It plots the autocorrelation of the Markov chains for each parameter of the model, allowing you to assess whether the chains have "mixed" properly.
Mathematical meaning: The autocorrelation is defined as:
\begin{equation} \rho_k = \frac{\text{Cov}(X_t, X_{t+k})}{\text{Var}(X)} \end{equation}
where $X_t$ is the chain at time $t$ and $k$ is the lag. Autocorrelation measures the dependence of the variable over time in the chain.
_ = az.plot_autocorr(idata_simpl,
var_names=[r'$b$', r'$m$'],
textsize=24)
5. $\texttt{az.plot\_forest}$:
What does it do?: It creates a forest plot representing the posterior distribution of the model parameters, typically visualizing the distributions of different chains and their means.
What does it plot?: A forest plot, showing the distributions of the parameters, their credibility intervals (e.g., 95%), and central values (such as the mean or median).
Mathematical meaning: Each parameter $\theta$ has a posterior distribution, represented by
The plot shows the density of that distribution and its confidence intervals.
warnings.filterwarnings('ignore', category=FutureWarning)
warnings.simplefilter("ignore", category=UserWarning)
_ = az.plot_forest(idata,
kind='forestplot',
var_names=[r'$\sigma$', r'$b$'],
figsize=(8, 6),
textsize=12)
_ = az.plot_forest(idata,
kind='forestplot',
var_names=[r'$m$'],
figsize=(8, 4),
textsize=12)
7. $\texttt{az.plot\_posterior}$:
What does it do?: It generates plots that show the posterior distribution of the model parameters. It is useful for visualizing the uncertainty about the parameters after performing Bayesian inference.
What does it plot?: It plots the posterior distribution of one or more model parameters. It can be in the form of histograms, KDE (Kernel Density Estimate), or density plots.
Mathematical meaning: The posterior distribution of a parameter $\theta$ is calculated using Bayes' theorem.
Where $p(\theta \mid \text{data})$ is the posterior distribution, $p(\text{data} \mid \theta)$ is the likelihood, $p(\theta)$ is the prior, and $p(\text{data})$ is the evidence.
_ = az.plot_posterior(idata_simpl,
var_names=[r'$b$', r'$m$'],
hdi_prob=0.94,
point_estimate='median',
textsize=12)
8. $\texttt{az.plot\_ppc}$:
What does it do?: This function generates a posterior predictive check plot, which shows new data simulated from the posterior parameter samples of the model.
What does it plot?: It compares the real observations to the new data simulated from the posterior model. It can use different plot types, such as KDE, scatter plots, etc.
Mathematical meaning: The posterior predictive for new data $y_{\text{new}}$ is calculated as:
\begin{equation} p(y_{\text{new}} \mid \text{data}) = \int p(y_{\text{new}} \mid \theta) p(\theta \mid \text{data}) d\theta \end{equation}
where $y_{\text{new}}$ is the new data you predict and $\theta$ are the parameters sampled from the posterior (Appendix F).
with model:
posterior_predictive = pm.sample_posterior_predictive(idata, extend_inferencedata=True)
Sampling: [y]
_ = az.plot_ppc(idata, textsize=11)
The $\texttt{pm.model\_to\_graphviz(model)}$ function in $\texttt{PyMC}$ generates a visual representation of the probabilistic model in the form of a directed graph, using $\texttt{Graphviz}$. It takes as input a $\texttt{PyMC}$ model ($\texttt{pymc.model()}$) and generates a dependency diagram between the random variables of the model.
pm.model_to_graphviz(model)
Only $\texttt{plain}$ format is supported in $\texttt{mode\_to\_graphviz}$
Considerations that should be noted for the model:
Rhat statistic also known as Gelman-Rubin diagnostic is used to measure the convergence in sample Markov chains so Rhat close to 1 indicates convergence but values greater than 1.01 could lead to sampling problems.
Efective size for some parameters should be greater than 100 for the sampling efficiency.
Part 4: and what happens to those uncomfortable or transcendent outliers?¶
(Extracted from Hogg et al. 2010)
The standard linear fitting method is very sensitive to outliers, that is, points that are substantially farther from the linear relation than expected—or not on the relation at all—because of unmodeled experimental uncertainty or unmodeled but rare sources of noise.
If we want to explicitly and objectively reject bad data points, we must add to the problem a set of $N$ binary integers $q_i$, one per data point, each of which is unity if the ith data point is good, and zero if the -ith data point is bad. In addition, to construct an objective function, one needs a parameter $P_b$, the prior probability that any individual data point is bad, and parameters $(Y_b, V_b)$, the mean and variance of the distribution of bad points (in $y$). We need these latter parameters because, as we have emphasized above, our inference comes from evaluating a probabilistic generative model for the data, and that model must generate the bad points as well as the good points.
However, this new set of parameters add $N+3$ extra parameters (all the $\{q_i\}^{N}_{i=1}$, $P_b$, $Y_b$, $V_b$) can be marginalized to just $P_b$, $Y_b$, $V_b$ (Appendix G).
In this way, the mixture model we use is: \begin{equation} \mathcal{L} = P\left(\{y_i\}^N_{i=1}|m,\,b,\,P_b,\,Y_b,\,V_b,\,I\right) \end{equation} where $P_b$ is the probability between 0 and 1 for data points to be bad, $(Y_b,\,V_b)$ are the mean an variance of bad points, $I$ refers to all prior knowledge, so it is set the likelihood as: \begin{equation} \mathcal{L} = \prod^N_{i=1} \left[\left(1-P_b\right)p_{\text{fg}}\left(\{y_i\}^N_{i=1}|m,\,b,\,I\right)\right] \left[P_b\:p_{\text{bg}}\left(\{y_i\}^N_{i=1}|Y_b,\,V_b,\,I\right)\right] = \end{equation} \begin{equation} = \prod^N_{i=1} \left[\frac{1 - P_b}{\sqrt{2\pi\sigma^2_i}} \exp{\left(-\frac{(y_i - mx_i - b)^2}{2\sigma^2_i}\right)} + \frac{P_b}{\sqrt{2\pi (V_b + \sigma^2_i)}} \exp{\left(-\frac{(y_i - Y_b)^2}{2(V_b + \sigma^2_i)}\right)}\right] \end{equation} where $p_{\text{fg}}$ is the model for the good points in a straight-line (foreground) and $p_{\text{bg}}$ the outliers (bad points).
with Model() as model_out:
#priors
intercept = Uniform(r'$b$', 0, 75)
slope = Uniform(r'$m$', 0, 10)
#priors for outliers
P_b = Uniform(r'$P_b$', 0.05, 0.75)
Y_b = Uniform(r'$Y_b$', 100, 700)
V_b = HalfNormal(r'$V_b$', 40)
#components
like = Normal.dist(mu=slope * x_column + intercept, sigma=sigma_y_column)
outlier = Normal.dist(mu=Y_b, sigma=np.sqrt(V_b + sigma_y_column**2))
#mixture distribution
y_likelihood = Mixture('y_likelihood', w=[1 - P_b, P_b], comp_dists=[like, outlier], observed=y_column)
#inference
trace = sample(1000, tune=1000, target_accept=0.95)
Auto-assigning NUTS sampler... Initializing NUTS using jitter+adapt_diag... Multiprocess sampling (4 chains in 4 jobs) NUTS: [$b$, $m$, $P_b$, $Y_b$, $V_b$]
Sampling 4 chains for 1_000 tune and 1_000 draw iterations (4_000 + 4_000 draws total) took 11 seconds.
az.plot_trace(trace, figsize=(10, 20), legend=False)
plt.rcParams.update({'font.size': 14, 'axes.linewidth': 1, 'axes.edgecolor': 'k'})
axes = plt.gcf().get_axes()
chain_colors = ['red', 'blue', 'green', 'purple']
for ax in axes:
lines = ax.get_lines()
for i, line in enumerate(lines):
line.set_color(chain_colors[i % len(chain_colors)])
line.set_linestyle('-')
plt.legend(labels=['Chain 0', 'Chain 1', 'Chain 2', 'Chain 4'], frameon=True, bbox_to_anchor=(1.05, 1), loc='upper left', fontsize=10)
plt.show()
_ = az.plot_pair(trace,
var_names=[r'$b$', r'$m$', r'$P_b$', r'$Y_b$', r'$V_b$'],
kind=['scatter', 'kde'],
divergences=True,
kde_kwargs={'fill_last': False},
marginals=True,
point_estimate='mean',
textsize=32)
warnings.simplefilter("ignore", category=FutureWarning)
plt.rcParams.update({'font.size': 11, 'axes.linewidth': 1, 'axes.edgecolor': 'k'})
# create subplot
_, ax = plt.subplots(figsize=(8, 6))
# custom scatter data features
ax.errorbar(x_column, y_column, yerr=sigma_y_column, fmt='o', label='Outliers?', capsize=5, markersize=5)
ax.errorbar(x_red, y_red, yerr=sigma_y_red, fmt='o', label='Reduced data', capsize=5, markersize=5)
x_array = np.linspace(40, 300, 1000)
ax.set_xlabel(r'$x$')
ax.set_ylabel(r'$y$')
# add mean regression line
regression_median = idata.posterior[r'$b$'].median().values + idata.posterior[r'$m$'].median().values * x_red
b_std = idata.posterior[r'$b$'].std().values
m_std = idata.posterior[r'$m$'].std().values
b = idata.posterior[r'$b$'].median().values
m = idata.posterior[r'$m$'].median().values
ax.text(40, 500, f'$y = ({round(float(m), 3)} \pm {round(float(m_std), 3)}) x + ({round(float(b),3)} \pm {round(float(b_std), 3)})$', fontsize=10, color='r')
ax.plot(x_red, regression_median, color='r', linestyle='-', linewidth=1, label='pymc regression (reduced data)')
X, A, Y, C, sigma_m, sigma_b = results_OLS
ax.plot(x_array, X[1]*x_array + X[0], linewidth=1, color='grey', label='algebraic regression (all data)')
ax.text(40, 475, f'$y = ({round(X[1],2)} \pm {round(sigma_m/np.sqrt(20),2)}) x + ({round(X[0],2)} \pm {round(sigma_b/np.sqrt(20),2)})$', fontsize=10, color='grey')
regression_median_out = trace.posterior[r'$b$'].median().values + trace.posterior[r'$m$'].median().values * x_column
b_std_out = trace.posterior[r'$b$'].std().values
m_std_out = trace.posterior[r'$m$'].std().values
b_out = trace.posterior[r'$b$'].median().values
m_out = trace.posterior[r'$m$'].median().values
ax.text(40, 450, f'$y = ({round(float(m_out), 3)} \pm {round(float(m_std_out), 3)}) x + ({round(float(b_out),3)} \pm {round(float(b_std_out), 3)})$', fontsize=10, color='b')
ax.plot(x_column, regression_median_out, color='b', linestyle='-', linewidth=1, label='pymc regression (mixture)')
_ = ax.legend(frameon=True, bbox_to_anchor=(1.05, 1), loc='upper left')
Part 5: what if we get not only $\sigma_y$ but $\sigma_x$?¶
What could happen if we got both $\sigma_x$ and $\sigma_y$? We can create a plot where the data $x$ and $y$, along with their uncertainties $\sigma_x$, $\sigma_y$, and covariance $\sigma_{xy} = \rho_{xy}\sigma_x\sigma_y$, are represented with uncertainties using ellipses instead of error bars. The ellipses should be 1-sigma ellipses centered at the point $(x, y)$. This means there should be one sigma uncertainty in the x-direction, one sigma uncertainty in the y-direction, and a correlation coefficient $\rho_{xy}$. The aspect ratio of the plot, denoted as $\texttt{aspect\_ratio}$, is defined as $$\frac{\Delta y/\text{vertical plot length}}{\Delta x/\text{horizontal plot length}}$$ The code should, for each point $(x, y)$, calculate the covariance, build the covariance matrix, compute the lengths of the ellipse axes (which are the square root of the eigenvalues of the covariance matrix), and determine the counterclockwise rotation angle of the ellipse as follows: $$\theta = \frac{1}{2}\tan^{-1}\left(\frac{1}{\text{aspect\_ratio}}\cdot\frac{2\sigma_{xy}}{\sigma_x^2 - \sigma_y^2}\right)$$
x = table_data[r'$x$']
y = table_data[r'$y$']
sigma_x = table_data[r'$\sigma_x$']
sigma_y = table_data[r'$\sigma_y$']
rho_xy = table_data[r'$\rho_{xy}$']
fig, ax = plt.subplots(1, 1, figsize=(8, 6))
x_array = np.linspace(40, 300, 1000)
#plot points with ellipses
for i in range(len(x)):
geom_reg = GeometricRegression(sigma_x[i], sigma_y[i], rho_xy[i], 1)
major_axis, minor_axis, theta = geom_reg.ellipse_axes()
#create ellipse
ellipse = Ellipse(xy=(x[i], y[i]), width=major_axis, height=minor_axis, angle=np.degrees(theta), edgecolor='red', facecolor='none')
#add to plot
ax.add_patch(ellipse)
ax.scatter(x, y, label='Data')
ax.set_xlabel(r'$x$')
ax.set_ylabel(r'$y$')
ax.grid(False)
ax.set_title('Bayesian model to reduced data $(\sigma_x,\,\sigma_y)$')
ax.tick_params(axis='both', direction='out')
_ = ax.legend(frameon=True)
With a covariance matrix of uncertainties in both axis given by: \begin{equation} \textbf{S}_i = \begin{pmatrix} \sigma^2_{x_i} & \sigma_{x_i y_i}\\ \sigma_{x_i y_i} & \sigma^2_{y_i}\\ \end{pmatrix} \end{equation} and a gaussian distribution of probability to get a point in $(x_i,\,y_i)$ such as: $$P(x_i,\,y_i|\textbf{S}_i,\,x,\,y) = \frac{1}{2\pi\sqrt{{\rm det}(\textbf{S}_i)}} \exp{\left(-\frac{1}{2}(\textbf{Z}_i - \textbf{Z})' \textbf{S}^{-1}_i (\textbf{Z}_i - \textbf{Z})\right)}$$ where: \begin{equation} \textbf{Z} = \begin{pmatrix} x \\ y \\ \end{pmatrix} ;\: \textbf{Z}_i = \begin{pmatrix} x_i \\ y_i \\ \end{pmatrix} \end{equation} so we measure the projections of two dimensional uncertainties to the line, with the slope $m$ defined by: \begin{equation} \textbf{v} = \frac{1}{\sqrt{1 + m^2}} \begin{pmatrix} - m \\ 1 \\ \end{pmatrix} = \begin{pmatrix} - \sin{\theta} \\ \cos{\theta} \\ \end{pmatrix} \end{equation} and the angle $\theta$ between the line and the x-axis, so the orthogonal displacement $\Delta_i$ of each point is given by: $$\Delta_i = \textbf{v}' \textbf{Z}_i - b\cos{\theta}$$ and the orthogonal variance $\Sigma^2_i$ given by: $$\Sigma^2_i = \textbf{v}' \textbf{S}_i \textbf{v}$$ resulting in the likelihood: $$\log{\mathcal{L}} = K - \sum^N_{i=1}\frac{\Delta^2_i}{2\Sigma^2_i}$$
Then, as we did before, we create a mixture model for outliers: \begin{equation} \log \mathcal{L} = P_b \log{\mathcal{L}_{\text{out}}} + (1-P_b) \log{\mathcal{L}_{\text{in}}} \end{equation} that translate into the following model (Appendix H):
x_data = table_data[r'$x$']
y_data = table_data[r'$y$']
sigma_x_data = table_data[r'$\sigma_x$']
sigma_y_data = table_data[r'$\sigma_y$']
rho_xy_data = table_data[r'$\rho_{xy}$']
# define a custom log-likelihood function for inliers
def log_likelihood_inliers(theta, b_t, v, x_data, y_data, sigma_x_data, sigma_y_data, rho_xy_data):
log_likelihood_sum = 0
for i in range(len(x_data)):
x_i = x_data[i]
y_i = y_data[i]
sigma_x = sigma_x_data[i]
sigma_y = sigma_y_data[i]
rho_xy = rho_xy_data[i]
S_i = np.array([[sigma_x**2, rho_xy * sigma_x * sigma_y],
[rho_xy * sigma_x * sigma_y, sigma_y**2]])
Z = np.array([x_i, y_i])
delta_i = np.dot(v, Z) - b_t
Sigma2_i = np.dot(v, np.dot(S_i, v))
log_likelihood_i = -0.5 * delta_i**2 / Sigma2_i
log_likelihood_sum += log_likelihood_i
return log_likelihood_sum
# define a custom log-likelihood function for outliers
def log_likelihood_outliers(X_b, Y_b, V_b, v, x_data, y_data, sigma_x_data, sigma_y_data, rho_xy_data):
log_likelihood_sum = 0
for i in range(len(x_data)):
Z = np.array([x_data[i], y_data[i]])
delta_i = np.dot(v, Z - np.array([X_b, Y_b]))
Sigma2_i = 2 * (np.dot(v, np.dot(np.diag([sigma_x_data[i]**2, sigma_y_data[i]**2]), v)) + V_b)
log_likelihood_i = -0.5 * delta_i**2 / Sigma2_i
log_likelihood_sum += log_likelihood_i
return log_likelihood_sum
with pm.Model() as model_out_G:
# parameters for the inlier likelihood
theta = pm.Uniform(r'$\theta$', lower=0.1, upper=2 * np.pi)
b_t = pm.Normal(r'$b_t$', mu=0, sigma=20)
v = [-np.sin(theta), np.cos(theta)]
# parameters for the outlier likelihood
P_b = pm.Beta(r'$P_b$', alpha=2, beta=5)
X_b = pm.Normal(r'$X_b$', mu=150, sigma=20) # Mean of the outlier in x
Y_b = pm.Normal(r'$Y_b$', mu=150, sigma=20) # Mean of the outlier in y
V_b = pm.HalfNormal(r'$V_b$', sigma=20) # Variance of the outlier
# likelihood for inliers
inlier_likelihood = log_likelihood_inliers(theta, b_t, v, x_data, y_data, sigma_x_data, sigma_y_data, rho_xy_data)
# likelihood for outliers
outlier_likelihood = log_likelihood_outliers(X_b, Y_b, V_b, v, x_data, y_data, sigma_x_data, sigma_y_data, rho_xy_data)
# dummy variable for the observed argument
dummy_observed = pm.Data('dummy_observed', np.zeros(1), mutable=False/True)
# define the log likelihood (without the constant term K)
log_likelihood = P_b * outlier_likelihood + (1 - P_b) * inlier_likelihood
# tell PyMC that the likelihood is proportional to the exponential of the log-likelihood
likelihood = pm.Potential('likelihood', log_likelihood)
# trace
trace_out_G = pm.sample(2000, tune=1000, target_accept=0.95)
Auto-assigning NUTS sampler... Initializing NUTS using jitter+adapt_diag... Multiprocess sampling (4 chains in 4 jobs) NUTS: [$\theta$, $b_t$, $P_b$, $X_b$, $Y_b$, $V_b$]
Sampling 4 chains for 1_000 tune and 2_000 draw iterations (4_000 + 8_000 draws total) took 10 seconds.
summary_table = az.summary(trace_out_G)
summary_table
| mean | sd | hdi_3% | hdi_97% | mcse_mean | mcse_sd | ess_bulk | ess_tail | r_hat | |
|---|---|---|---|---|---|---|---|---|---|
| $b_t$ | -16.139 | 14.804 | -43.980 | 10.777 | 0.203 | 0.144 | 5309.0 | 5565.0 | 1.0 |
| $X_b$ | 73.334 | 11.218 | 52.208 | 94.332 | 0.159 | 0.113 | 4992.0 | 5239.0 | 1.0 |
| $Y_b$ | 187.864 | 19.330 | 151.839 | 223.814 | 0.249 | 0.177 | 6064.0 | 5281.0 | 1.0 |
| $\theta$ | 4.256 | 0.035 | 4.189 | 4.322 | 0.001 | 0.000 | 4561.0 | 4586.0 | 1.0 |
| $P_b$ | 0.969 | 0.015 | 0.943 | 0.993 | 0.000 | 0.000 | 6378.0 | 5342.0 | 1.0 |
| $V_b$ | 76.882 | 15.537 | 47.577 | 106.015 | 0.180 | 0.127 | 7392.0 | 4174.0 | 1.0 |
_ = az.plot_pair(trace_out_G,
var_names=[r'$P_b$', r'$X_b$', r'$Y_b$', r'$V_b$', r'$\theta$', r'$b_t$'],
kind='kde',
divergences=True,
kde_kwargs={'fill_last': False},
marginals=True,
point_estimate='median',
textsize=34)
b_t_mean = summary_table['mean'][r'$b_t$']
theta_mean = summary_table['mean'][r'$\theta$']
b_t_sd = summary_table['sd'][r'$b_t$']
theta_sd = summary_table['sd'][r'$\theta$']
sigma_m = theta_sd / np.cos(theta_mean)**2
sigma_b = b_t_sd / np.cos(theta_mean)
fig, ax = plt.subplots(1, 1, figsize=(8, 6))
x_array = np.linspace(40, 300, 1000)
#plot points with ellipses
for i in range(len(x)):
geom_reg = GeometricRegression(sigma_x[i], sigma_y[i], rho_xy[i], 1)
major_axis, minor_axis, theta = geom_reg.ellipse_axes()
#create ellipse
ellipse = Ellipse(xy=(x[i], y[i]), width=major_axis, height=minor_axis, angle=np.degrees(theta), edgecolor='red', facecolor='none')
#add to plot
ax.add_patch(ellipse)
ax.scatter(x_data, y_data, color='#1f77b4', label='Data')
ax.set_xlabel(r'$x$')
ax.set_ylabel(r'$y$')
ax.plot(x_array, np.tan(theta_mean)*x_array + b_t_mean/np.cos(theta_mean), linewidth=1, label='Geometric regression 2D uncertainties')
ax.grid(False)
latex_text = r'$\sqrt{S}$'
ax.tick_params(axis='both', direction='out')
ax.text(35, 450, f'$ y = ({round(np.tan(theta_mean),3)} \pm {round(sigma_m, 3)})x + ({round(b_t_mean/np.cos(theta_mean)*(-1),3)} \pm {round(sigma_b, 3)})$', fontsize=10, color='#1f77b4')
_ = ax.legend(frameon=True)
Part 6: M-dwarfs activity (x-rays) vs rotation (Pizzolato et al. 2003)¶
At last, another bayesian inference data fitting is presented. In this case, we seek to get the x-rays vs rotation relation from M-dwarf stars.
The most interesting part of this section is about fitting two different ensembles from the same data: one corresponds to a saturation zone which fits to a constant value and, on the other hand, a data set we expect to have a linear decreasing regression in terms of the rotational period. In this way, we have a series of data $x_{\text{data}} = \log{(\mathcal{P})}$ and $y_{\text{data}} = \log{(\mathcal{L}_X/\mathcal{L}_{bol})}$ that exhibit the following distribution in the $xy$ plot: a portion of the data is distributed with a Gaussian dispersion around the horizontal line of $y=\log{(\mathcal{L}_{sat}/\mathcal{L}_{bol})}$ between $\min(x_{\text{data}})$ and a value $\log{(\mathcal{P}_{sat})}$ in the saturation region where $x_{\text{data}} < \log{(\mathcal{P}_{sat})}$. Then, starting from $\log{(\mathcal{P}_{sat})}$ for $x_{\text{data}} > \log{(\mathcal{P}_{sat})}$, the data begins to follow a linear regression with a few outliers shifted above the majority of data following the regression:
$$\log{\left(\frac{\mathcal{L}_X}{\mathcal{L}_{bol}}\right)} = m\cdot \log{(\mathcal{P})} + C$$
As priors, it is known, and well seen in the plot, that $\log{(\mathcal{L}_{sat}/\mathcal{L}_{bol})}$ should be close to -3, $\log{(\mathcal{P}_{sat})}$ will be between 0.7 and 1.2, $m$ will have a value around -1.4, and $C$ around -0.009. Furthermore, it is crucial that the regression to which the data approximates be continuous with the horizontal line, i.e., it is a necessary condition that the complete fitting model has a horizontal line of $y=\log{(\mathcal{L}_{sat}/\mathcal{L}_{bol})}$ and connects with the linear regression at $\log{(\mathcal{P}_{sat})}$, such that $\log{(\mathcal{L}_{sat}/\mathcal{L}_{bol})} = \log{(\mathcal{P}_{sat})} \cdot m + C$.
Therefore, a $\texttt{PyMC}$ model for hierarchical Bayesian inference should involve first obtaining the points $\log{(\mathcal{P}_{sat})}$ and $\log{(\mathcal{L}_{sat}/\mathcal{L}_{bol})}$ from uniform priors between the a priori limits and normal priors on the a priori expected value, respectively. Once these values are obtained, the parameters $m$ and $C$ are desired for the data to the right of $\log{(\mathcal{P}_{sat})}$ so that the inference model obtains the parameters $m$, $C$, $\sigma_{\log{(\mathcal{L}_{sat}/\mathcal{L}_{bol})}}$ (sigma of the data in $y$), $V_{b}$ (variance of the outliers), $\log{(\mathcal{L}_{b}/\mathcal{L}_{bol})}$ (mean in $y$ of the outliers), and $P_{b}$ (probability of being an outlier).
The mixed model of two likelihoods will have a normal distribution for the inlier data on the linear regression $m \cdot x + C$ with $\sigma_{\log{(\mathcal{L}_{sat}/\mathcal{L}_{bol})}}$, and another for the outlier data with $\mu=\log{(\mathcal{L}_{b}/\mathcal{L}_{bol})}$ and $\sigma=\sqrt{\sigma_{\log{(\mathcal{L}_{b}/\mathcal{L}_{bol})}}^2 + V_{b}}$. The weights of this mixed model will be the probabilities of $(1-P_{b})$ for the inliers and $P_{b}$ for the outliers.
data = pd.read_csv('carmencita.104_6.2_P_d_M.csv')
data = data.replace(np.nan, 0)
plotter = LuminosityPeriodPlotter(data) # 'data' is your dictionary containing the necessary data
fig, log_data_P_d, log_L_x_L_Lsol_array, n_array_new, n_array_max_new = plotter.plot() # Generate the plot
# Display the plot
plt.show()
w/ periods: 323 w/ luminosities and periods: 220 w/ spectral type: 323
The $\texttt{pm.math.switch(cond, ift, iff)}$ function in $\texttt{PyMC}$ is a function that acts as a conditional operator. It evaluates a boolean condition (cond) and returns ift if the condition is true ($\texttt{True}$) and iff if the condition is false ($\texttt{False}$). Essentially, it performs a selection between two values based on the condition's value.
In the current code, $\texttt{pm.math.switch()}$ is used to assign different models and parameters depending on whether $\texttt{x\_data}$ is to the right or to the left of $\texttt{log\_P\_sat}$.
The variable $\texttt{x\_saturation}$ takes the value 0 for points where $\texttt{x\_data}$ is greater than $\texttt{log\_P\_sat}$ and 1 otherwise. Then, in the subsequent lines, $\texttt{idx\_saturation}$ is used to select the appropriate model and parameters for each group of data.
It's a way to model different behavior based on a condition in your data. In this case, it is used to model saturation to the left and linear regression to the right of $\texttt{log\_P\_sat}$.
x_data = log_data_P_d
y_data = log_L_x_L_Lsol_array
#define model
with pm.Model() as model_saturation:
#priors
log_P_sat = pm.Normal(r'$\log{(\mathcal{P}_{sat})}$', mu=0.5, sigma=0.1)
m = pm.Normal(r'$m$', mu=-1.4, sigma=1)
C = pm.Normal(r'$C$', mu=-0.0075, sigma=0.1)
sigma_log_L = pm.HalfNormal(r'$\sigma_{\log{(\mathcal{L}_{sat}/\mathcal{L}_{bol})}}$', sigma=1)
#constrain log_P_sat * m + C to be equal to log_L_sat
log_L_sat_constrained = pm.Deterministic(r'$\log{(\mathcal{L}_{sat}/\mathcal{L}_{bol})}$', log_P_sat * m + C)
V_b = pm.HalfNormal(r'$V_b$', sigma=1)
log_L_b = pm.Normal(r'$\log{(\mathcal{L}_{b}/\mathcal{L}_{bol})}$', mu=-3.5, sigma=1)
P_b = pm.Beta(r'$P_b$', alpha=2, beta=5)
#divide data into two ensembles
idx_saturation = pm.math.switch(x_data > log_P_sat, 0, 1)
#right log_P_sat
mu_right = pm.math.switch(idx_saturation, m * x_data + C, log_L_b)
#mixture term for in and outliers
inlier_likelihood = pm.Normal.dist(mu=mu_right, sigma=sigma_log_L)
outlier_likelihood = pm.Normal.dist(mu=log_L_b, sigma=pm.math.sqrt(sigma_log_L**2 + V_b))
likelihood_right = pm.Mixture('likelihood_right', w=[P_b, 1 - P_b], comp_dists=[inlier_likelihood, outlier_likelihood], observed=y_data)
#left log_P_sat saturation regime
mu_saturation = pm.math.switch(x_data <= log_P_sat, log_L_sat_constrained, m * x_data + C)
likelihood_saturation = pm.Normal('likelihood_saturation', mu=mu_saturation, sigma=sigma_log_L, observed=y_data)
#sampling
trace_saturation = pm.sample(1000, tune=1000, step=[pm.NUTS(target_accept=0.9, max_treedepth=40)], chains=4)
Auto-assigning NUTS sampler...
Initializing NUTS using jitter+adapt_diag...
Multiprocess sampling (4 chains in 4 jobs)
NUTS: [$\log{(\mathcal{P}_{sat})}$, $m$, $C$, $\sigma_{\log{(\mathcal{L}_{sat}/\mathcal{L}_{bol})}}$, $V_b$, $\log{(\mathcal{L}_{b}/\mathcal{L}_{bol})}$, $P_b$]
Sampling 4 chains for 1_000 tune and 1_000 draw iterations (4_000 + 4_000 draws total) took 11 seconds.
summary_table = pm.summary(trace_saturation)
summary_table
| mean | sd | hdi_3% | hdi_97% | mcse_mean | mcse_sd | ess_bulk | ess_tail | r_hat | |
|---|---|---|---|---|---|---|---|---|---|
| $\log{(\mathcal{P}_{sat})}$ | 1.077 | 0.031 | 1.020 | 1.135 | 0.001 | 0.000 | 3715.0 | 2860.0 | 1.0 |
| $m$ | -2.585 | 0.076 | -2.731 | -2.439 | 0.001 | 0.001 | 3164.0 | 2944.0 | 1.0 |
| $C$ | -0.282 | 0.101 | -0.483 | -0.102 | 0.002 | 0.001 | 2842.0 | 2607.0 | 1.0 |
| $\log{(\mathcal{L}_{b}/\mathcal{L}_{bol})}$ | -3.425 | 0.046 | -3.517 | -3.342 | 0.001 | 0.001 | 3579.0 | 2485.0 | 1.0 |
| $\sigma_{\log{(\mathcal{L}_{sat}/\mathcal{L}_{bol})}}$ | 0.536 | 0.027 | 0.484 | 0.585 | 0.001 | 0.000 | 2602.0 | 2405.0 | 1.0 |
| $V_b$ | 0.175 | 0.053 | 0.082 | 0.281 | 0.001 | 0.001 | 3142.0 | 1864.0 | 1.0 |
| $P_b$ | 0.013 | 0.009 | 0.001 | 0.030 | 0.000 | 0.000 | 3380.0 | 2028.0 | 1.0 |
| $\log{(\mathcal{L}_{sat}/\mathcal{L}_{bol})}$ | -3.067 | 0.045 | -3.150 | -2.980 | 0.001 | 0.000 | 4126.0 | 3489.0 | 1.0 |
fig, ax = plt.subplots(1, 1, figsize = (10, 8))
#an example colormap
colormap = plt.cm.hot
colormap = colormap.reversed()
#normalize using the min and max values in a
normalize = plt.Normalize(vmin=min(n_array_new), vmax=n_array_max_new)
#scatter plot with the parameters defined above
im = ax.scatter(log_data_P_d, log_L_x_L_Lsol_array, c=n_array_new, s=20, cmap=colormap, marker='o', zorder=1)
divider = make_axes_locatable(ax)
cax = divider.append_axes('top', size='5%', pad=0.05)
fig.colorbar(im, cax=cax, orientation='horizontal').ax.set_xlabel('$n$, M$n$V')
cax.xaxis.set_ticks_position('top')
cax.xaxis.set_label_position('top')
ax.set_ylabel('$\log{(\mathcal{L}_x/\mathcal{L}_{bol})}$')
ax.get_xaxis().set_major_formatter(matplotlib.ticker.ScalarFormatter())
ax.set_xlabel('$\log{(\mathcal{P}\, [d])}$')
P_array_post = np.linspace(summary_table['mean'][r'$\log{(\mathcal{P}_{sat})}$'], max(log_data_P_d), 100)
ax.plot(P_array_post, summary_table['mean'][r'$m$']*P_array_post + summary_table['mean'][r'$C$'], linewidth=1)
P_array_sat = np.linspace(min(log_data_P_d), summary_table['mean'][r'$\log{(\mathcal{P}_{sat})}$'], 100)
_ = ax.plot(P_array_sat, 0*P_array_sat + summary_table['mean'][r'$\log{(\mathcal{L}_{sat}/\mathcal{L}_{bol})}$'], linewidth=1)
warnings.simplefilter("ignore", category=FutureWarning)
_ = az.plot_pair(trace_saturation,
var_names=[r'$P_b$', r'$V_b$', r'$\log{(\mathcal{P}_{sat})}$', r'$m$', r'$C$'],
kind='kde', # Kernel Density Estimate (KDE)
divergences=True,
kde_kwargs={'fill_last': False},
marginals=True,
point_estimate='median',
textsize=34)
# Create the model with the provided x_data and y_data
x_data = log_data_P_d # Replace with your actual data
y_data = log_L_x_L_Lsol_array # Replace with your actual data
saturation_model = SaturationModel(x_data, y_data)
trace_saturation, model_saturation = saturation_model.build_model()
# Visualize the model
pm.model_to_graphviz(model_saturation)
Auto-assigning NUTS sampler... Initializing NUTS using jitter+adapt_diag... Multiprocess sampling (4 chains in 4 jobs) NUTS: [log_P_sat, m, C, sigma_log_L, V_b, log_L_b, P_b]
Sampling 4 chains for 1_000 tune and 1_000 draw iterations (4_000 + 4_000 draws total) took 14 seconds.
References¶
Hogg, D.W; Bovy, J.; Lang, D.; astro-ph.IM, 2010.
Pizzolato, N.; Maggio, A.; Micela, G.; A&A vol. 397, 147-157, 2003.
Appendix¶
Appendix A: algebraic data fitting method and covariance matrix demonstration¶
In the Ordinary Least Squares (OLS) method, we assume that the errors $\varepsilon$ are independent and identically distributed (i.i.d.) with constant variance, meaning each data point contributes equally to the fit. To obtain the best-fitting line, we minimize the sum of the squared residuals:
\begin{equation} \chi^2 = \sum_{i=1}^{N} (y_i - \hat{y}_i)^2 \end{equation}
The solution to the OLS problem can be found by solving the following equation:
\begin{equation} A = (X^T X)^{-1} X^T Y \end{equation}
where $X^T$ is the transpose of the design matrix $X$, and $X^T X$ is the matrix that determines the sensitivity of the model to each of the parameters.
The covariance of the parameters in OLS is calculated by:
\begin{equation} \text{Cov}(\beta) = \hat{\sigma}^2 (X^T X)^{-1} \end{equation}
where $\hat{\sigma}^2$ is the residual variance, estimated as:
\begin{equation} \hat{\sigma}^2 = \frac{\sum_{i=1}^{N} (y_i - \hat{y}_i)^2}{N - p} \end{equation}
where $p$ is the number of parameters (in this case, 2: the slope $m$ and the intercept $b$).
The variance of the individual parameters (slope and intercept) is:
\begin{equation} \text{Var}(\hat{m}) = \frac{\hat{\sigma}^2}{\sum_{i=1}^{N} (x_i - \bar{x})^2} \end{equation} \begin{equation} \text{Var}(\hat{b}) = \frac{\hat{\sigma}^2 \sum_{i=1}^{N} x_i^2}{N \sum_{i=1}^{N} (x_i - \bar{x})^2} \end{equation}
The covariance matrix for the parameters is given by (Appendix A):
\begin{equation} \text{Cov} (\hat{b}, \hat{m}) = \hat{\sigma}^2 (X^T X)^{-1} = \hat{\sigma}^2 \begin{bmatrix} \dfrac{\sum_{i=1}^{N} x_i^2}{N \sum_{i=1}^{N} (x_i - \bar{x})^2} & - \dfrac{\bar{x}}{\sum_{i=1}^{N} (x_i - \bar{x})^2} \\ - \dfrac{\bar{x}}{\sum_{i=1}^{N} (x_i - \bar{x})^2} & \dfrac{1}{\sum_{i=1}^{N} (x_i - \bar{x})^2} \end{bmatrix} \end{equation}
In the Weighted Least Squares (WLS) method, we relax the assumption made in OLS that all observations have the same variance. Instead, we allow each observation to have its own variance $\sigma_{y_i}^2$. To account for this, we introduce a weight for each data point, which is inversely proportional to the variance of the corresponding error. Specifically, the weights are given by:
\begin{equation} w_i = \frac{1}{\sigma_{y_i}^2} \end{equation}
This means that data points with smaller uncertainties (smaller errors) receive more weight in the fitting process.
The weighted least squares fitting method involves minimizing a chi-squared expression that takes these weights into account (see Appendix B):
\begin{equation} \chi^2 = \sum_{i=1}^{N} \frac{(y_i - \hat{y}_i)^2}{\sigma_{y_i}^2} = (Y - X A)^T C^{-1} (Y - X A) \end{equation}
where $C$ is the covariance matrix that contains the variances $\sigma_{y_i}^2$ for each observation:
\begin{equation} C = \begin{bmatrix} \sigma_{y_1}^2 & 0 & \cdots & 0 \\ 0 & \sigma_{y_2}^2 & \cdots & 0 \\ \vdots & \vdots & \ddots & \vdots \\ 0 & \cdots & 0 & \sigma_{y_N}^2 \end{bmatrix} \end{equation}
The solution to the weighted regression problem can be found by solving the following equation:
\begin{equation} A = (X^T C^{-1} X)^{-1} X^T C^{-1} Y \end{equation}
This equation minimizes the chi-squared parameter, taking into account the weights $w_i$.
The matrix of coefficients can also be directly computed using the following formulas:
\begin{equation} A = \begin{bmatrix} b \\ m \end{bmatrix} = \frac{\sum w_i x_i^2 \sum w_i y_i - \sum w_i x_i \sum w_i x_i y_i}{\Delta}, \quad m = \frac{\sum w_i \sum w_i x_i y_i - \sum w_i x_i \sum w_i y_i}{\Delta} \end{equation}
where $\Delta$ is the determinant of the matrix:
\begin{equation} \Delta = \sum w_i \sum w_i x_i^2 - \left( \sum w_i x_i \right)^2 \end{equation}
The variance of the residuals in WLS is estimated similarly to OLS:
\begin{equation} \hat{\sigma}^2 = \frac{\varepsilon^T \varepsilon}{N - p - 1} \end{equation}
and the variance of the parameters is computed using the weights as well:
\begin{equation} \text{Var}(\hat{m}) = \frac{\hat{\sigma}^2}{\sum w_i (x_i - \bar{x})^2} \end{equation}
\begin{equation} \text{Var}(\hat{b}) = \frac{\hat{\sigma}^2 \sum w_i x_i^2}{N \sum w_i (x_i - \bar{x})^2} \end{equation}
The covariance matrix of the parameters in WLS is:
\begin{equation} \text{Cov}(\hat{b}, \hat{m}) = \hat{\sigma}^2 (X^T C^{-1} X)^{-1} \end{equation}
- The Design Matrix $X$.
In linear regression, the design matrix $X$ is defined as follows:
\begin{equation} X = \begin{bmatrix} 1 & x_1 \\ 1 & x_2 \\ \vdots & \vdots \\ 1 & x_N \end{bmatrix} \end{equation}
Where:
- $N$ is the number of data points.
- $x_i$ is the independent variable value for the $i$-th data point.
- The Matrix $X^T X$.
Now, we compute $X^T X$, the product of the transpose of $X$ with $X$:
\begin{equation} X^T X = \begin{bmatrix} N & \sum_{i=1}^N x_i \\ \sum_{i=1}^N x_i & \sum_{i=1}^N x_i^2 \end{bmatrix} \end{equation}
Where:
- $N$ is the number of data points.
- $\sum_{i=1}^N x_i$ is the sum of the independent variable values.
- $\sum_{i=1}^N x_i^2$ is the sum of the squares of the independent variable values.
- The Inverse of $X^T X$.
To compute the inverse of a $2 \times 2$ matrix, we use the following formula:
\begin{equation} \quad A = \begin{bmatrix} a & b \\ c & d \end{bmatrix}, \quad A^{-1} = \frac{1}{ad - bc} \begin{bmatrix} d & -b \\ -c & a \end{bmatrix} \end{equation}
For the matrix $X^T X$, applying this formula gives:
\begin{equation} (X^T X)^{-1} = \frac{1}{N \sum_{i=1}^N x_i^2 - \left( \sum_{i=1}^N x_i \right)^2} \begin{bmatrix} \sum_{i=1}^N x_i^2 & -\sum_{i=1}^N x_i \\ -\sum_{i=1}^N x_i & N \end{bmatrix} \end{equation}
- Covariance Matrix of $\hat{b}$ and $\hat{m}$.
The covariance matrix for the parameters $\hat{b}$ and $\hat{m}$ in linear regression is given by:
\begin{equation} \text{Cov}(\hat{b}, \hat{m}) = \hat{\sigma}^2 (X^T X)^{-1} \end{equation}
Substituting the inverse of $X^T X$ into this expression, we get:
\begin{equation} \text{Cov}(\hat{b}, \hat{m}) = \hat{\sigma}^2 \cdot \frac{1}{N \sum_{i=1}^N x_i^2 - \left( \sum_{i=1}^N x_i \right)^2} \begin{bmatrix} \sum_{i=1}^N x_i^2 & -\sum_{i=1}^N x_i \\ -\sum_{i=1}^N x_i & N \end{bmatrix} \end{equation}
- Simplifying the Covariance Matrix.
Now, we simplify the result by expressing the terms in terms of the mean and variance of $x$:
- The mean of $x$, denoted by $\bar{x}$, is given by:
\begin{equation} \bar{x} = \frac{1}{N} \sum_{i=1}^N x_i \end{equation}
- The variance of $x$, denoted as $\sum_{i=1}^N (x_i - \bar{x})^2$, is:
\begin{equation} \sum_{i=1}^N (x_i - \bar{x})^2 = \sum_{i=1}^N x_i^2 - \frac{1}{N} \left( \sum_{i=1}^N x_i \right)^2 \end{equation}
Using these definitions, we can simplify the elements of the inverse of $X^T X$ to get the covariance matrix:
\begin{equation} \text{Cov}(\hat{b}, \hat{m}) = \hat{\sigma}^2 \cdot \frac{1}{N \sum_{i=1}^N (x_i - \bar{x})^2} \begin{bmatrix} \sum_{i=1}^N x_i^2 & -\sum_{i=1}^N x_i \\ -\sum_{i=1}^N x_i & N \end{bmatrix} = \hat{\sigma}^2 \begin{bmatrix} \dfrac{\sum_{i=1}^N x_i^2}{N \sum_{i=1}^N (x_i - \bar{x})^2} & - \dfrac{\bar{x}}{\sum_{i=1}^N (x_i - \bar{x})^2} \\ - \dfrac{\bar{x}}{\sum_{i=1}^N (x_i - \bar{x})^2} & \dfrac{1}{\sum_{i=1}^N (x_i - \bar{x})^2} \end{bmatrix} \end{equation} where $\bar{x} = (1/N)\sum^N_{i=1} x_i$.
- Final Covariance Matrix.
This is the covariance matrix provided:
\begin{equation} {\rm Cov} (\hat{b},\hat{m}) = \hat{\sigma}^2 (X^T X)^{-1} = \hat{\sigma}^2 \begin{bmatrix} \dfrac{\sum^N_{i=1} x^2_i}{N \sum^N_{i=1} (x_i - \bar{x})^2} & - \dfrac{\bar{x}}{\sum^N_{i=1}(x_i - \bar{x})^2}\\ - \dfrac{\bar{x}}{\sum^N_{i=1}(x_i - \bar{x})^2} & \dfrac{1}{\sum^N_{i=1}(x_i - \bar{x})^2} \end{bmatrix} \end{equation}
This covariance matrix describes the relationship between the estimated parameters $\hat{b}$ and $\hat{m}$, where the off-diagonal elements are the covariances and the diagonal elements are the variances of the parameters. The matrix shows how the spread in the data influences the uncertainty in the estimated parameters of the linear regression model.
Appendix B: chi-2 minimized demonstration¶
To prove that the equation
$$ A = [X^T C^{-1} X]^{-1} [X^T C^{-1} Y] $$
minimizes the chi-squared parameter
$$ \chi^2 = [Y - X A]^T C^{-1} [Y - X A], $$
we follow these steps:
- Expand the expression for $\chi^2$. We know that: $$ \chi^2 = [Y - X A]^T C^{-1} [Y - X A]. $$
Expanding this matrix product: $$ \chi^2 = (Y^T - A^T X^T) C^{-1} (Y - X A). $$
Using distributive properties: $$ \chi^2 = Y^T C^{-1} Y - Y^T C^{-1} X A - A^T X^T C^{-1} Y + A^T X^T C^{-1} X A. $$
Notice that the terms $- Y^T C^{-1} X A$ and $- A^T X^T C^{-1} Y$ are scalars (numbers), and scalars are symmetric. Therefore: $$ \chi^2 = Y^T C^{-1} Y - 2 A^T X^T C^{-1} Y + A^T X^T C^{-1} X A. $$
- Differentiate $\chi^2$ with respect to $A$. To minimize $\chi^2$, we take its derivative with respect to $A$ (the parameter vector) and set it equal to zero. We calculate the derivative term by term:
-The derivative of $Y^T C^{-1} Y$ with respect to $A$ is zero because it does not depend on $A$. -The derivative of $-2 A^T X^T C^{-1} Y$ is: $$ \frac{\partial}{\partial A} \left(-2 A^T X^T C^{-1} Y\right) = -2 X^T C^{-1} Y. $$ -The derivative of $A^T X^T C^{-1} X A$ is: $$ \frac{\partial}{\partial A} \left(A^T X^T C^{-1} X A\right) = 2 X^T C^{-1} X A. $$
Adding these results: $$ \frac{\partial \chi^2}{\partial A} = -2 X^T C^{-1} Y + 2 X^T C^{-1} X A. $$
- Set the derivative to zero. To minimize $\chi^2$, set the derivative to zero: $$ -2 X^T C^{-1} Y + 2 X^T C^{-1} X A = 0. $$
Divide by 2: $$ X^T C^{-1} X A = X^T C^{-1} Y. $$
- Solve for $A$. Rearranging to solve for $A$: $$ A = [X^T C^{-1} X]^{-1} [X^T C^{-1} Y]. $$
This is the equation we sought to prove.
- Connection to $\chi^2$. This result shows that the values of $A$ calculated using this formula are the ones that minimize the $\chi^2$ parameter. This is because taking the derivative of $\chi^2$ with respect to $A$ and setting it to zero finds the point where the gradient of $\chi^2$ is zero, which corresponds to the minimum of the quadratic function.
The equation: $$ A = [X^T C^{-1} X]^{-1} [X^T C^{-1} Y] $$ minimizes $\chi^2$ because it solves the system of linear equations obtained by differentiating $\chi^2$ with respect to $A$. This procedure guarantees that the parameters $A$ provide the best fit for the data, accounting for the weighting introduced by the covariance matrix $C$.
x = np.array(table_data[r"$x$"])
y = np.array(table_data[r"$y$"])
sigma_y = np.array(table_data[r"$\sigma_y$"])
def model(x, m, b):
return m * x + b
def chi_squared(params):
m, b = params
y_model = model(x, m, b)
chi2 = np.sum(((y - y_model) / sigma_y) ** 2)
return chi2
initial_guess = [1, 1]
result = minimize(chi_squared, initial_guess)
m_fit, b_fit = result.x
chi2_min = result.fun
print(f"Slope fitted (m): {m_fit}")
print(f"Intercept fitted (b): {b_fit}")
print(f"Minimun chi-2: {chi2_min}")
y_fit = model(x, m_fit, b_fit)
fig, ax = plt.subplots(1, 1, figsize=(8, 6))
x_column = table_data[r'$x$'].values
y_column = table_data[r'$y$'].values
sigma_y_column = table_data[r'$\sigma_y$'].values
ax.errorbar(x_column, y_column, yerr=sigma_y_column, fmt='o', label='Data', capsize=5, markersize=5)
ax.set_xlabel(r'$x$')
ax.set_ylabel(r'$y$')
plt.plot(x, y_fit, label=f'Fit: $y = {m_fit:.2f}x + {b_fit:.2f}$', color='red', lw=1)
plt.legend()
plt.title('Fit by $\chi^2$ minimization')
plt.show()
Slope fitted (m): 1.0767468471881418 Intercept fitted (b): 213.2736258347447 Minimun chi-2: 289.9637227820858
data = [[201,244,47,287,203,58,210,202,198,158,165,201,157,131,166,160,186,125,218,146],
[592,401,583,402,495,173,479,504,510,416,393,442,317,311,400,337,423,334,533,344],
[61,25,38,15,21,15,27,14,30,16,14,25,52,16,34,31,42,26,16,22],
[9, 4, 11, 7, 5, 9, 4, 4, 11, 7, 5, 5, 5, 6, 6, 5, 9, 8, 6, 5],
[-0.84, 0.31, 0.64, -0.27, -0.33, 0.67, -0.02, -0.05, -0.84, -0.69, 0.3, -0.46, -0.03, 0.5, 0.73, -0.52, 0.9, 0.4, -0.78, -0.56]]
columns = [r'$x$', r'$y$', r'$\sigma_y$', r'$\sigma_x$', r'$\rho_{xy}$']
table_data = pd.DataFrame(data).T
table_data.columns = columns
x = np.array(table_data[r"$x$"])
y = np.array(table_data[r"$y$"])
sigma_y = np.array(table_data[r"$\sigma_y$"])
def model(x, m, b):
return m * x + b
chi2_results = []
def chi_squared(params):
m, b = params
y_model = model(x, m, b)
chi2 = np.sum(((y - y_model) / sigma_y) ** 2)
chi2_results.append((chi2, m, b))
return chi2
initial_guess = [1, 1]
result = minimize(chi_squared, initial_guess)
m_fit, b_fit = result.x
chi2_min = result.fun
chi2_array = np.array(chi2_results)
m_range = np.linspace(0, 3, 100)
b_range = np.linspace(0, 300, 100)
chi2_vs_m = []
chi2_vs_b = []
m_fit = result.x[0]
b_fit = result.x[1]
X_matrix = np.vstack([np.ones_like(x), x]).T
C_matrix = np.diag(sigma_y**2) # Matriz de covarianza de los datos
cov_matrix = np.linalg.inv(X_matrix.T @ np.linalg.inv(C_matrix) @ X_matrix)
sigma_m = np.sqrt(cov_matrix[1, 1])
sigma_b = np.sqrt(cov_matrix[0, 0])
b_fixed = b_fit
for m in m_range:
y_model = model(x, m, b_fixed)
chi2 = np.sum(((y - y_model) / sigma_y) ** 2)
chi2_vs_m.append(chi2)
m_fixed = m_fit
for b in b_range:
y_model = model(x, m_fixed, b)
chi2 = np.sum(((y - y_model) / sigma_y) ** 2)
chi2_vs_b.append(chi2)
plt.figure(figsize=(12, 5))
plt.rcParams.update({'font.size': 12, 'axes.linewidth': 1, 'axes.edgecolor': 'k'})
plt.rcParams['font.family'] = 'serif'
plt.subplot(1, 2, 1)
plt.plot(m_range, chi2_vs_m, label=rf"$\chi^2$ vs $m$ (Best fit: $m = {m_fit:.2f} \pm {sigma_m:.2f}$)")
plt.xlabel(r"$m$")
plt.ylabel(r"$\chi^2$")
plt.title(r"$\chi^2$ vs $m$")
plt.legend()
plt.subplot(1, 2, 2)
plt.plot(b_range, chi2_vs_b, label=rf"$\chi^2$ vs $b$ (Best fit: $b = {b_fit:.2f} \pm {sigma_b:.2f}$)", color="orange")
plt.xlabel(r"$b$")
plt.ylabel(r"$\chi^2$")
plt.title(r"$\chi^2$ vs $b$")
plt.legend()
plt.tight_layout()
plt.show()
Appendix C: $\sigma$ inferred or observed?¶
The mathematical justification for changing the model such that $\sigma$ is an observed data point rather than a parameter to be inferred can be understood from the formulation of the bayesian regression model and how the likelihood and posterior distributions change under this assumption.
1. Model with $\sigma$ as an Inferred Parameter:
When $\sigma$ is a parameter to be inferred, the bayesian regression model is formulated as follows:
1.1. Regression Model:
The relationship between the dependent variable $y$ and the independent variable $x$ is described by a normal distribution, where the model is: \begin{equation} y = b + m x + \epsilon \end{equation} where $\epsilon \sim N(0, \sigma^2)$, and $\sigma$ is the parameter of dispersion that models the noise in the data.
1.2. Likelihood Function:
The likelihood, i.e., the probability of observing the data $y$ given the parameters $\beta_0$, $\beta_1$, and $\sigma$, is: \begin{equation} p(y | b, m, \sigma) = \prod_{i=1}^{N} \frac{1}{\sqrt{2 \pi \sigma^2}} \exp\left( -\frac{(y_i - (b + m x_i))^2}{2 \sigma^2} \right) \end{equation}
Here, the probability of the data is the product of individual normal distributions for each observation, where each $y_i$ depends on the parameters $b$ and $m$, but also on the uncertainty $\sigma$ in the measurements.
1.3. Posterior Distribution:
In this model, the posterior distribution $p(b, m, \sigma | y)$ is proportional to the likelihood multiplied by the prior distributions: \begin{equation} p(b, m, \sigma | y) \propto p(y | b, m, \sigma) \cdot p(b) \cdot p(m) \cdot p(\sigma) \end{equation}
Here, the posterior distribution of $\sigma$ allows us to estimate its value and uncertainty, as it is treated as a parameter to be inferred.
2. Model with $\sigma$ as an Observed Data Point:
When $\sigma$ is treated as an observed data point instead of a parameter to be inferred, the model formulation changes as follows:
2.1. Likelihood Function with $\sigma$ Observed:
Now, $\sigma$ is known and fixed, so there is no need to include $\sigma$ in the inference. The likelihood becomes: \begin{equation} p(y | b, m) = \prod_{i=1}^{N} \frac{1}{\sqrt{2 \pi \sigma_i^2}} \exp\left( -\frac{(y_i - (b + m x_i))^2}{2 \sigma_i^2} \right) \end{equation}
This is the same formula as before, but now $\sigma_i$ is not a parameter in the regression model. It is simply a known value used to scale the likelihood.
2.2. Posterior Distribution with $\sigma$ Observed:
The posterior distribution now only depends on the parameters $b$ and $m$: \begin{equation} p(b, m | y) \propto p(y | b, m) \cdot p(b) \cdot p(m) \end{equation}
In this model, there is no posterior distribution for $\sigma$ because it is not a parameter to be inferred.
3. Comparison of Models:
3.1. Uncertainty about $\sigma$:
When $\sigma$ is a parameter to be inferred, the model explicitly accounts for the uncertainty about $\sigma$, which is reflected in the posterior distribution of $\sigma$. This introduces greater flexibility in the model because the model can adapt to different levels of uncertainty in the data (or noise).
When $\sigma$ is observed, there is no uncertainty about $\sigma$, and the model assumes that the value of $\sigma$ is known and constant. This simplifies the model, but also reduces the ability to model the variability of the error, which may not be appropriate if $\sigma$ is not truly constant or well-known.
3.2. Priors and Model Complexity:
When $\sigma$ is a parameter to be inferred, a prior distribution must be specified for $\sigma$. The prior on $\sigma$ captures our beliefs about the magnitude and distribution of the error before observing the data. This flexibility can be useful when $\sigma$ is not known, and we want the model to adjust both the model parameters and the error term.
However, when $\sigma$ is observed, the prior on $\sigma$ is removed because $\sigma$ is not a parameter to be inferred. The model becomes simpler, but it loses the flexibility to account for the uncertainty in $\sigma$.
3.3. Posterior Distribution and Parameters $b$ and $m$:
When $\sigma$ is a parameter, the posterior distributions of $b$ and $m$ are broader because they reflect the uncertainty in the value of $\sigma$. This additional uncertainty in $\sigma$ propagates into the posterior of the regression parameters.
When $\sigma$ is observed, the posterior distributions of $b$ and $m$ are narrower, as there is no need to deal with the uncertainty in $\sigma$. As a result, the estimates of $b$ and $m$ are more precise, but at the cost of not capturing the variability that might be present in $\sigma$.
4. Conclusion:
- Model with $\sigma$ as an inferred parameter: This model accounts for both the regression parameters $b$ and $m$ and the uncertainty in $\sigma$, leading to a more complex posterior distribution that reflects the variability of $\sigma$.
- Model with $\sigma$ as an observed data point: This model simplifies the inference process by treating $\sigma$ as a known value. However, this reduction in model complexity comes at the cost of losing the ability to model the uncertainty in $\sigma$, which can be problematic if $\sigma$ is not well-known or constant.
This change is significant because by making $\sigma$ an observed data point, we reduce the flexibility of the model, but also make it simpler and faster by eliminating one parameter from the inference process. However, this simplification may be inappropriate if $\sigma$ is not known or varies, as the model cannot capture the uncertainty in the errors.
Appendix D: what other statistics parameters we can obtain?¶
When performing Bayesian inference, several statistical metrics are used to assess the quality of the posterior distribution. For the parameters \texttt{intercept} and \texttt{slope}, we have the following:
HDI_3% and HDI_97%:
- HDI stands for Highest Density Interval.
- These values represent the interval containing the 94% of the posterior distribution for the parameter. Specifically, $\texttt{HDI\_3\%}$ corresponds to the 3% percentile, and $\texttt{HDI\_97\%}$ corresponds to the 97% percentile.
- The interval between these two values is the region where 94% of the posterior distribution lies, providing a range for the plausible values of the parameter.
MCSE_mean and MCSE_sd:
- MCSE stands for Monte Carlo Standard Error.
- These values give an estimate of the precision of the mean and standard deviation of the posterior samples.
- $\texttt{MCSE\_mean}$ refers to the standard error for the mean of the posterior samples, while $\texttt{MCSE\_sd}$ refers to the standard error for the standard deviation of the posterior samples.
- Smaller MCSE values indicate more precise estimates, while larger values suggest greater uncertainty.
ESS_bulk and ESS_tail:
- ESS stands for Effective Sample Size.
- $\texttt{ESS\_bulk}$ measures the effective sample size in the "bulk" (central portion) of the posterior distribution, while $\texttt{ESS\_tail}$ measures the effective sample size in the "tails" (extreme values) of the distribution.
- These values indicate how independent the posterior samples are. A higher ESS indicates less autocorrelation in the samples and better exploration of the parameter space.
- Typically, $\texttt{ESS\_bulk}$ should be larger than $\texttt{ESS\_tail}$, as the central portion of the distribution is usually denser than the tails.
R_hat:
- This is the Gelman-Rubin statistic, which is used to check if the MCMC chains have converged properly.
- An $\texttt{R\_hat}$ value close to 1.0 indicates that the chains have converged, and the obtained parameter estimates are reliable.
- If $\texttt{R\_hat}$ is significantly greater than 1.0, it suggests that the chains have not converged, and the sampling may not have adequately explored the parameter space. In this case, more iterations may be needed.
In summary, these values help assess the quality and precision of the posterior samples in terms of convergence, accuracy of estimates, and reliability of the intervals. If $\texttt{R\_hat}$ is near 1.0, and the $\texttt{MCSE}$ and $\texttt{ESS}$ values are small or large in the relevant regions, the inference is likely to be reliable.
Appendix E: NUTS sampler¶
NUTS is an adaptive Markov Chain Monte Carlo (MCMC) sampling algorithm used for bayesian inference. It is an extension of the Hamiltonian Monte Carlo (HMC) method that automatically tunes the trajectory length, avoiding the need for manual tuning of a step size. NUTS is designed to sample from high-dimensional distributions efficiently by using gradients of the target distribution (i.e., the log-likelihood) to explore the parameter space.
How does NUTS work?
NUTS uses the Hamiltonian dynamics of the system (a physical model) to explore the posterior distribution. The "energy" of the system is governed by both the potential energy (the negative log-probability of the model) and the kinetic energy (based on momenta).
The algorithm uses a numerical method called leapfrog integration to simulate the motion of particles through the parameter space. This technique allows it to take "steps" through the posterior distribution, maintaining detailed balance and ergodicity.
NUTS automatically tunes the trajectory length (i.e., the number of steps) it takes for each iteration to explore the posterior. It uses the no-u-turn criterion to determine when to stop the simulation, ensuring that it does not waste time moving in directions that do not improve the sampling.
Unlike the traditional HMC, NUTS adapts the trajectory length during sampling. If it detects that a trajectory length is too long (leading to wasted computation), it will adjust the trajectory length in future steps.
Key Features of NUTS:
- No manual tuning: One of the biggest advantages of NUTS over standard HMC is that it doesn't require manual tuning of the step size or trajectory length, making it much easier to use in practice.
- Efficient for high dimensions: NUTS is well-suited for high-dimensional posterior distributions, where traditional MCMC algorithms like Metropolis-Hastings may struggle.
- Uses gradient information: NUTS utilizes gradients of the log-likelihood (or potential energy) to move through parameter space, allowing it to more efficiently explore complex posterior distributions.
- Parallelizable: While $\texttt{PyMC}$ itself may not always parallelize the MCMC chains automatically, NUTS is amenable to parallelization in custom setups.
Appendix F: posterior and posterior predictive¶
The posterior is the distribution of the model parameters after observing the data. It is what we want to estimate using Bayes' theorem. In simple terms, it is an update of our beliefs about the model's parameters after seeing the data.
In your case, when you sample parameters such as $m$ and $b$ with $\texttt{pm.sample()}$, you are obtaining the values of the parameters that best explain the observed data, given certain assumptions (the model and prior distributions). These parameters might include something like:
- $m$: the slope of the linear regression
- $b$: the intercept of the regression
The posterior of these parameters shows the probability distributions of the parameters given the data. In other words, it is the "estimation" of the parameter values based on the observed data.
For example, if you have a linear regression and a dataset, the posterior samples of $m$ and $b$ will tell you, given the data, what is the likely range of values for $m$ and $b$.
1. Posterior Predictive: The posterior predictive, on the other hand, is the distribution of future predictions you would make using the model, given the observed data and the sampled parameters from the posterior. In other words:
- The posterior predictive shows how future observations (data) would behave if the parameter values you estimated from the model were applied.
- It is the process of generating new data (simulations) based on what you know about your model and your current data.
It is a simulation of future data based on the parameter distributions you obtained from the posterior. Using the sampled values of $m$ and $b$, you can make predictions for new $y$ values (in the case of linear regression) and see how they vary depending on the posterior parameters.
2. Key Differences:
- Posterior of the parameters: Samples from the model's parameters (such as $m$ and $b$) given the observed data. It tells you what values are likely for the parameters.
- Posterior predictive: Predictions of new data or future outcomes, based on the parameter samples obtained from the posterior. It tells you what values would be reasonable for new observations under the same model.
In Summary:
- The posterior is a distribution over the parameters of the model. It tells you how you estimate the parameters $m$ and $b$ after seeing the data.
- The posterior predictive is a distribution over the predictions of data (i.e., future or simulated values), based on the posterior samples of the parameters.
Simple Example to Clarify:
Posterior: After observing the data from a regression, the parameters $m$ and $b$ might have distributions such as:
- $m \sim N(1.2, 0.05)$ (this means that the slope $m$ has a mean of 1.2 and a standard deviation of 0.05 after observing the data).
- $b \sim N(0.5, 0.1)$ (this means that the intercept $b$ has a mean of 0.5 and a standard deviation of 0.1).
These are the likely values for the parameters given the data.
Posterior Predictive: Then, using those $m$ and $b$ parameters from the posterior, you can generate predictions for new observations (for example, for $x$ values not observed in the data). So, the posterior predictive distribution might give you a range for the possible future $y$ values, given that your parameters are distributed according to the posterior values.
How to Use This Information in Your Plots:
When you use the posterior predictive with $\texttt{az.plot\_ppc()}$, you are plotting those simulations of new data ($y$) using the sampled parameter values from the posterior.
When you look at the posterior of $m$ and $b$, you are seeing how these parameters are distributed and what their most likely values are, given the data.
Posteriors: How likely is each value of the parameters?
Posterior Predictive: How might the future data behave, given what we know about the model's parameters?
Appendix G: prunning outliers¶
The process to transform the $N+3$ parameters set $(\{q_i\}^N_{i=1}, P_b, Y_b, V_b)$ to just $P_b$, $Y_b$, $V_b$, goes through marginalization, so in the end, we will be left with an inference just of the line parameters $(m, b)$. In general, there is no reason to be concerned just because you have more parameters than data. However, the marginalization will require that we have a measure on our parameters (integrals require measures) and that measure is provided by a prior.
In this case, the likelihood is: \begin{equation} \mathcal{L} = P\left(\{y_i\}^N_{i=1}|m,\,b,\,\{q_i\}^N_{i=1},\,Y_b,\,V_b,\,I\right) \end{equation} \begin{equation} \Rightarrow\:\mathcal{L} = \prod^N_{i=1} \left[p_{\text{fg}}\left(\{y_i\}^N_{i=1}|m,\,b,\,I\right)\right]^{q_i} \left[p_{\text{bg}}\left(\{y_i\}^N_{i=1}|Y_b,\,V_b,\,I\right)\right]^{1 - q_i} \end{equation} \begin{equation} \Rightarrow\:\mathcal{L} = \prod^N_{i=1} \left[\frac{1}{\sqrt{2\pi\sigma^2_i}} \exp{\left(-\frac{(y_i - mx_i - b)^2}{2\sigma^2_i}\right)}\right]^{q_i} \:\left[\frac{1}{\sqrt{2\pi (V_b + \sigma^2_i)}} \exp{\left(-\frac{(y_i - Y_b)^2}{2(V_b + \sigma^2_i)}\right)}\right]^{1-q_i} \end{equation} where $p_{\text{fg}}$ is the model for the good points in a straight-line (foreground) and $p_{\text{bg}}$ the outliers (bad points).
Because we are permitting data rejection, there is an important prior probability on the $\{q_i\}^N_{i=1}$ that penalizes each rejection, given by the joint distribution factorization: \begin{equation} P(m, b, \{q_i\}^N_{i=1}, P_b, Y_b, V_b|I) = p(\{q_i\}^N_{i=1} | P_b, I) p(m, b, P_b, Y_b, V_b | I) \end{equation} where: \begin{equation} p(\{q_i\}^N_{i=1} | P_b, I) = \prod_{i=1}^{N} [1 - P_b]^{q_i} P_b^{(1 - q_i)} \end{equation} that is the binomial probability of the particular sequence $\{q_i\}^N_{i=1}$. The priors on the other parameters ($P_b$, $Y_b$, $V_b$) should either be set according to an analysis of your prior knowledge or else be uninformative.
In this way: \begin{equation} P\left(\{y_i\}^N_{i=1}|m, b, \{q_i\}^N_{i=1}, Y_b, V_b,I\right)\:P(m, b, \{q_i\}^N_{i=1}, P_b, Y_b, V_b|I) = \end{equation} \begin{equation} = \left(\prod^N_{i=1} [p_{\text{fg}}]^{q_i} [p_{\text{bg}}]^{1-q_i}\right) \left(\prod^N_{i=1} [1-P_b]^{q_i} [P_b]^{1-q_i}\right) P(m,b,Y_b,V_b|I) \end{equation} \begin{equation} = \prod^N_{i=1} (p_{\text{fg}}[1-P_b]^{q_i}) (p_{\text{bg}}[P_b]^{1-q_i}) P(m,b,Y_b,V_b|I) \end{equation} what yields to: \begin{equation} P\left(m, b, \{q_i\}^N_{i=1}, P_b, Y_b, V_b|\{y_i\}^N_{i=1},I\right) = \dfrac{\prod^N_{i=1} (p_{\text{fg}}[1-P_b]^{q_i}) (p_{\text{bg}}[P_b]^{1-q_i}) P(m,b,Y_b,V_b|I)}{P(\{y_i\}^N_{i=1}|I)} \end{equation}
We use Gaussian model because is the maximum-entropy distribution described by a mean and a variance, it is—in some sense—the least restrictive assumption, given that we have allowed that mean and variance to vary in the fitting.
So, the marginalization for $m$, $b$, looks like: \begin{equation} P(m,b|\{y_i\}^N_{i=1}, I) = \int d\{q_i\}^N_{i=1} \: dP_b \: dY_b \: dV_b \: P\left(m, b, \{q_i\}^N_{i=1}, P_b, Y_b, V_b|\{y_i\}^N_{i=1},I\right) \end{equation} what means a sum over all of the $2^N$ possible settings of the $\{q_i\}^N_{i=1}$, and the integrals over the other parameters are over their entire prior support. Effectively, then, this marginalization involves evaluating $2^N$ different likelihoods and marginalizing each one of them over the parameters ($P_b$, $Y_b$, $V_b$). Otherwise, we can marginalize this model to get a more tractable mixture model, marginalizing over each individual $q_i$, getting the mixture model we use here.
Appendix H: $\sigma$ for $m$ and $b$ from $\theta$ and $b_t$¶
We are interested in calculating the uncertainties of the parameters $m$ (slope) and $b$ (intercept) from the uncertainties in the parameters $\theta$ and $b_t$. The transformation of the parameters follows from the geometric interpretation of the line equation.
1. Definition of parameters:
We start with the following relationships: the equation of the line is given by $y = mx + b$, where $m$ is the slope and $b$ is the intercept, the slope $m$ is related to the angle $\theta$ by $m = \tan(\theta)$, and the intercept $b$ is related to $b_t$ by $b = b_t \cos(\theta)$, where $b_t$ is the perpendicular distance from the origin to the line.
2. Uncertainties of parameters:
Given the uncertainties of $\theta$ and $b_t$, we want to compute the uncertainties in $m$ and $b$. To calculate the uncertainty in the slope $m$, we need to differentiate $m = \tan(\theta)$ with respect to $\theta$: \begin{equation} \sigma_m = \left| \frac{dm}{d\theta} \right| \sigma_{\theta} = \left| \sec^2(\theta) \right| \sigma_{\theta} = \frac{\sigma_{\theta}}{\cos^2(\theta)} \end{equation}
Thus, the uncertainty in the slope is: \begin{equation} \sigma_m = \frac{\sigma_{\theta}}{\cos^2(\theta)} \end{equation}
Next, we calculate the uncertainty in the intercept $b$. We differentiate $b = b_t \cos(\theta)$ with respect to $b_t$ and $\theta$: \begin{equation} \sigma_b = \left| \frac{db}{db_t} \right| \sigma_{b_t} + \left| \frac{db}{d\theta} \right| \sigma_{\theta} \end{equation}
We first calculate the derivative of $b$ with respect to $b_t$: \begin{equation} \frac{db}{db_t} = \cos(\theta) \end{equation}
Then, we calculate the derivative of $b$ with respect to $\theta$: \begin{equation} \frac{db}{d\theta} = -b_t \sin(\theta) \end{equation}
The total uncertainty is given by: \begin{equation} \sigma_b = \frac{\sigma_{b_t}}{\cos(\theta)} \cdot (-1) \end{equation}
Thus, the uncertainty in the intercept is: \begin{equation} \sigma_b = - \frac{\sigma_{b_t}}{\cos(\theta)} \end{equation}
3. Final results:
Now that we have the expressions for the uncertainties in $m$ and $b$, we can summarize the results:
- The slope $m$ is given by $m = \tan(\theta)$, and its uncertainty is $\sigma_m = \frac{\sigma_{\theta}}{\cos^2(\theta)}$.
- The intercept $b$ is given by $b = b_t \cos(\theta)$, and its uncertainty is $\sigma_b = -\frac{\sigma_{b_t}}{\cos(\theta)}$.
print(f"Running on tqdm v{matplotlib.__version__}")
Running on tqdm v3.8.2